Data Strategy Archives - Tiger Analytics

Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution

TA@2023 — Fri, 26 Jul 2024 10:07:10 +0000

In the high-stakes world of financial technology, data is king. But what happens when that data becomes a labyrinth of inconsistencies? This was the challenge faced by a senior data engineer at a leading fintech company.

“Our merchant identification system is failing us. We’re losing millions in potential revenue and our fraud detection is compromised. We need a solution, fast.”

The issue was clear but daunting. Every day, their system processed millions of transactions, each tied to a merchant. But these merchant names needed to be more consistent. For instance, a well-known retail chain might be listed as its official name, a common abbreviation, a location-specific identifier, or simply a generic category. This inconsistency was wreaking havoc across the business.

Initial Approach: Snowflake SQL Procedures

Initially, the data engineer and his team developed Snowflake SQL procedures to handle this complex data transformation. While these procedures worked they wanted to add the automated testing pipelines and quickly realized the limitations. “We need more robust regression and automated testing capabilities. And we need to implement these tests without constantly connecting to a Snowflake account.” This capability wasn’t possible with traditional Snowflake SQL procedures, pushing them to seek external expertise.

Enter Tiger Analytics: A New Approach with Snowpark and Local Testing Framework

After understanding the challenges, the Tiger team proposed a solution: leveraging Snowpark for complex data transformations and introducing a local testing framework. This approach aimed to solve the merchant identity issue and improve the entire data pipeline process.

To meet these requirements, the team turned to Snowpark. Snowpark enabled them to perform complex data transformations and manipulations within Snowflake, leveraging the power of Snowflake’s computational engine. However, the most crucial part was the Snowpark Python Local Testing Framework. This framework allowed the team to develop and test their Snowpark DataFrames, stored procedures, and UDFs locally, fulfilling the need for regression testing and automated testing without connecting to a Snowflake account.

Key Benefits

Local Development: The team could develop and test their Snowpark Python code without a Snowflake account. This reduced the barrier to entry and sped up their iteration cycles.
Efficient Testing: By utilizing familiar testing frameworks like PyTest, the team integrated their tests seamlessly into existing development workflows.
Enhanced Productivity: The team quickly iterated on their code with local feedback, enabling rapid prototyping and troubleshooting before deploying to their Snowflake environment.

Overcoming Traditional Unit Testing Limitations

In the traditional sense of unit testing, Snowpark does not support a fully isolated environment independent of a Snowflake instance. Typically, unit tests would mock a database object, but Snowpark lacks a local context for such mocks. Even using the create_dataframe method requires Snowflake connectivity.

The Solution with Local Testing Framework

Despite these limitations, the Snowpark Python Local Testing Framework enabled the team to create and manipulate DataFrames, stored procedures, and UDFs locally, which was pivotal for our use case. Here’s how the Tiger team did it:

Setting Up the Environment

First, set up a Python environment:

pip install "snowflake-snowpark-python[localtest]"
pip install pytest

Next, create a local testing session:

from snowflake.snowpark import Session
session = Session.builder.config('local_testing', True).create()

Creating Local DataFrames

The Tiger team created DataFrames from local data sources and operated on them:

table = 'example'
session.create_dataframe([[1, 2], [3, 4]], ['a', 'b']).write.save_as_table(table)

Operating on these DataFrames was straightforward:

df = session.create_dataframe([[1, 2], [3, 4]], ['a', 'b'])
res = df.select(col('a')).where(col('b') > 2).collect()
print(res)

Creating UDFs and Stored Procedures

The framework allowed the team to create and call UDFs and stored procedures locally:

from snowflake.snowpark.functions import udf, sproc, call_udf, col
from snowflake.snowpark.types import IntegerType, StringType

@udf(name='example_udf', return_type=IntegerType(), input_types=[IntegerType(), IntegerType()])
def example_udf(a, b):
    return a + b

@sproc(name='example_proc', return_type=IntegerType(), input_types=[StringType()])
def example_proc(session, table_name):
    return session.table(table_name)\
        .with_column('c', call_udf('example_udf', col('a'), col('b')))\
        .count()

# Call the stored procedure by name
output = session.call('example_proc', table)

Using PyTest for Efficient Testing

The team leveraged PyTest for efficient unit and integration testing:

PyTest Fixture

In the conftest.py file, the team created a PyTest fixture for the Session object:

import pytest
from snowflake.snowpark.session import Session

def pytest_addoption(parser):
    parser.addoption("--snowflake-session", action="store", default="live")

@pytest.fixture(scope='module')
def session(request) -> Session:
    if request.config.getoption('--snowflake-session') == 'local':
        return Session.builder.configs({'local_testing': True}).create()
    else:
        snowflake_credentials = {} # Specify Snowflake account credentials here
        return Session.builder.configs(snowflake_credentials).create()

Using the Fixture in Test Cases

from project.sproc import my_stored_proc
  
def test_create_fact_tables(session):
    expected_output = ...
    actual_output = my_stored_proc(session)
    assert expected_output == actual_output

Running Tests

To run the test suite locally:

pytest --snowflake-session local

To run the test suite against your Snowflake account:

pytest

Addressing Unsupported Functions

Some functions were not supported in the local testing framework. For these, the team used patch functions and MagicMock:

Patch Functions

For unsupported functions like upper(), used patch functions:

from unittest.mock import patch
from snowflake.snowpark import Session
from snowflake.snowpark.functions import upper
  
session = Session.builder.config('local_testing', True).create()
  
@patch('snowflake.snowpark.functions.upper')
def test_to_uppercase(mock_upper):
    mock_upper.side_effect = lambda col: col + '_MOCKED'
    df = session.create_dataframe([('Alice',), ('Bob',)], ['name'])
    result = df.select(upper(df['name']))
    collected = result.collect()
    assert collected == [('Alice_MOCKED',), ('Bob_MOCKED',)]

MagicMock

For more complex behaviors like explode(), team used MagicMock:

from unittest.mock import MagicMock
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col

session = Session.builder.config('local_testing', True).create()

def test_explode_df():
    mock_explode = MagicMock()
    mock_explode.side_effect = lambda: 'MOCKED_EXPLODE'
    df = session.create_dataframe([([1, 2, 3],), ([4, 5, 6],)], ['data'])
    with patch('snowflake.snowpark.functions.explode', mock_explode):
        result = df.select(col('data').explode())
        collected = result.collect()
        assert collected == ['MOCKED_EXPLODE', 'MOCKED_EXPLODE']

test_explode_df()

Scheduling Procedures Limitations

While implementing these solutions, the Tiger team faced issues with scheduling procedures using serverless tasks, so they used the Task attached to the warehouse. They created the Snowpark-optimized warehouse. The team noted that serverless tasks cannot invoke certain object types and functions, specifically:

UDFs (user-defined functions) that contain Java or Python code.
Stored procedures are written in Scala (using Snowpark) or those that call UDFs containing Java or Python code.

Turning Data Challenges into Business Insights

The journey from the initial challenge of extracting merchant identities from inconsistent transaction data to a streamlined, efficient process demonstrates the power of advanced data solutions. The Tiger team leveraged Snowpark and its Python Local Testing Framework, not only solving the immediate problem but also enhancing their overall approach to data pipeline development and testing. The combination of regex-based, narration’s pattern-based, and ML-based methods enabled them to tackle the complexity of unstructured bank statement data effectively.

This project’s success extends beyond merchant identification, showcasing how the right tools and methodologies can transform raw data into meaningful insights. For data engineers facing similar challenges, this case study highlights how Snowpark and local testing frameworks can significantly improve data application development, leading to more efficient, accurate, and impactful solutions.

The post Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution appeared first on Tiger Analytics.

A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric

TA@2023 — Mon, 01 Jul 2024 12:13:32 +0000

Organizations often face challenges in effectively leveraging data to streamline operations and enhance customer satisfaction. Siloed data, complexities associated with ingesting, processing, and storing data at scale, and limited collaboration across departments can hinder a company’s ability to make informed, data-driven decisions. This can result in missed opportunities, inefficiencies, and suboptimal customer experiences.

Here’s where Microsoft’s new SaaS platform “Microsoft Fabric” can give organizations a much-needed boost. By integrating data across various functions, including data science (DS), data engineering (DE), data analytics (DA), and business intelligence (BI), Microsoft Fabric enables companies to harness the full potential of their data. The goal is to enable seamless sharing of data across the organization while simplifying all the key functions of Data Engineering, Data Science, and Data Analytics to facilitate quicker and better-informed decision-making at scale.

For enterprises looking to utilize Microsoft Fabric’s full capabilities, understanding the platform’s pricing and licensing intricacies is crucial, impacting several key financial aspects of the organization:

1. Reserved vs Pay-as-you-go: Understanding pay-as-you-go versus reserved pricing helps in precise budgeting and can affect both initial and long-term operational costs.
2. Capacity Tiers: Clear knowledge of capacity tiers allows for predictable scaling of operations, facilitating smooth expansions without unexpected costs.
3. Fabric Tenant Hierarchy: It is important to understand the tenant hierarchy as this would have a bearing on the organization’s need to buy capacity based on their unique needs.
4. Existing Power BI Licenses: For customers having existing Power BI, it is important to understand how to utilize existing licenses (free/pro/premium) and how it ties in with Fabric SKU.

At Tiger Analytics, our team of seasoned SMEs have helped clients navigate the intricacies of licensing and pricing models for robust platforms like Microsoft Fabric based on their specific needs.

In this blog, we will provide insights into Microsoft Fabric’s pricing strategies to help organizations make more informed decisions when considering this platform.

Overview of Microsoft Fabric:

Microsoft Fabric offers a unified and simplified cloud SaaS platform designed around the following ‘Experiences’:

Data Ingestion – Data Factory
Data Engineering – Synapse DE
Data Science – Synapse DS
Data Warehousing – Synapse DW
Real-Time Analytics – Synapse RTA
Business Intelligence – Power BI
Unified storage – OneLake

A Simplified Pricing Structure

Unlike Azure, where each tool has separate pricing, Microsoft Fabric simplifies this by focusing on two primary cost factors:

1. Compute Capacity: A single compute capacity can support all functionalities concurrently, which can be shared across multiple projects and users without any limitations on the number of workspaces utilizing it. You do not need to select capacities individually for Data Factory, Synapse Data Warehousing, and other Fabric experiences.

2. Storage: Storage costs are separate yet simplified, making choices easier for the end customer.

Understanding Fabric’s Capacity Structure

To effectively navigate the pricing and licensing of Microsoft Fabric, it is crucial to understand how a Fabric Capacity is associated with Tenant and Workspaces. These three together help organize the resources within an Organization and help manage costs and operational efficiency.

1. Tenant: This represents the highest organizational level within Microsoft Fabric, and is associated with a single Microsoft Entra ID. An organization could also have multiple tenants.

2. Capacity: Under each tenant, there are one or more capacities. These represent pools of compute and storage resources that power the various Microsoft Fabric services. Capacities provide capabilities for workload execution. These are analogous to horsepower for car engines. The more you provision capacity, the more workloads can be run or can be run faster.

3. Workspace: Workspaces are environments where specific projects and workflows are executed. Workspaces are assigned a capacity, which represents the computing resources it can utilize. Multiple workspaces can share the resources of a single capacity, making it a flexible way to manage different projects or departmental needs without the necessity of allocating additional resources for each new project/ department.

The figure above portrays the Tenant hierarchy in Fabric and how different organizations can provision capacities based on their requirements.

Understanding Capacity Levels, SKUs, and Pricing in Microsoft Fabric

Microsoft Fabric capacities are defined by a Stock Keeping Unit (SKU) that corresponds to a specific amount of compute power, measured in Capacity Units (CUs). A CU is a unit that quantifies the amount of compute power available.

Capacity Units (CUs) = Compute Power

As shown in the table below, each SKU (Fn) is represented with a CU. E.g. F4 is double in capacity as compared to F2 but is half that of F8.

The breakdown below shows the SKUs available for the West Europe region, showing both Pay As You Go and Reserved (1-year) pricing options:

Comparative table showing Fabric SKUs, CUs, associated PBI SKU, Pay-as-you-Go and Reserved pricing for a region.
1 CU pay-as-you-price at West EU Region = $0.22/hour
1 CU PAYGO monthly rate calculation: $0.22*730 =$160.6, F2 =$160.6*2=$321.2
1 CU RI monthly rate calculation: Round ($0.22* (1-0.405)*730*12,0)/12=~$95.557…F2 RI = ~$95.557…*2=~$191.11

Pricing Models Explained:

Pay As You Go: This flexible model allows you to pay monthly based on the SKU you select, making it ideal if your workload demands are uncertain. You can purchase more capacity or even upgrade/downgrade your capacity. You further get an option to pause your capacities to save costs.

Reserved (1 year): In this option, you pay reserved prices monthly. The reservation is for 1 year. The prices of reserved can give you a savings of around 40%. It involves no option to pause and is billed monthly regardless of capacity usage.

Storage Costs in Microsoft Fabric (OneLake)

In Microsoft Fabric, compute capacity does not include data storage costs. This means that businesses need to budget separately for storage expenses.

Storage costs need to be paid for separately.
Storage costs in Fabric (OneLake) are similar to ADLS (Azure Data Lake Storage).
BCDR (Business continuity Disaster recovery) charges are also included. This comes into play when Workspaces are deleted but some data needs to be extracted from the same.
Beyond this, there are costs for cache storage (for KQL DB)
There are also costs for the transfer of data between regions – which is known as Bandwidth pricing. More details are in this link.

Optimizing Resource Use in Microsoft Fabric: Understanding Bursting and Smoothing Techniques

Despite purchasing a capacity, your workload may demand higher resources in between.

For this, Fabric allows two methods to help with faster execution (burst) while flattening the usage over time (smooth) to maintain optimal costs.

Bursting: Bursting enables the use of additional compute resources beyond your existing capacity to accelerate workload execution. For instance, if a task normally takes 60 seconds using 64 CUs, bursting can allocate 256 CUs to complete the same task in just 15 seconds.
Smoothing: Smoothing is applied automatically in Fabric across all capacities to manage brief spikes in resource usage. This method distributes the compute demand more evenly over time, which helps in avoiding extra costs that could occur with sudden increases in resource use.

Understanding Consumption: Where do your Computation Units (CUs) go?

Image credit: Microsoft

The following components in Fabric consume or utilize the CU (Capacity Units)

Data Factory Pipelines
Data Flow Gen2
Synapse Warehouse
Spark Compute
Event Stream
KQL Database
OneLake
Copilot
VNet Data Gateway
Data Activator (Reflex)
PowerBI

The CU consumption depends on the solution implemented for functionality. Here’s an example for better understanding:

Business Requirement: Ingest data from an on-prem data source and use it for Power BI reporting.

Solution Implemented: Data Factory pipelines with Notebooks to perform DQ checks on the ingested data. PowerBI reports were created pointing to the data in One Lake.

How are CU’s consumed:

CUs would be consumed every time the data factory pipeline executes and further invokes the Notebook (Spark Compute) to perform data quality checks.

Further, CU’s would get consumed whenever the data refreshes on the dashboard.

Microsoft Fabric Pricing Calculator:

Microsoft has streamlined the pricing calculation with its online calculator. By selecting your region, currency, and billing frequency (hourly or monthly), you can quickly view the pay-as-you-go rates for all SKUs. This gives you an immediate estimate of the monthly compute and storage costs for your chosen region. Additionally, links for reserved pricing and bandwidth charges are also available.

For more detailed and specific pricing analysis, Microsoft offers an advanced Fabric SKU Calculator tool through partner organizations.

Understanding Fabric Licensing: Types and Strategic Considerations

Licensing in Microsoft Fabric is essential because it legally permits and enables the use of its services within your organizational framework, ensuring compliance and tailored access to various functionalities. Licensing is distinct from pricing, as licensing outlines the terms and types of access granted, whereas pricing involves the costs associated with these licenses.

There are two types of licensing in Fabric:

Capacity-Based Licensing: This licensing model is required for operating Fabric’s services, where Capacity Units (CUs) define the extent of compute resources available to your organization. Different Stock Keeping Units (SKUs) are designed to accommodate varying workload demands, ranging from F2 to F2048. This flexibility allows businesses to scale their operations up or down based on their specific needs.
Per-User Licensing: User-based licensing was used in Power BI, and this has not changed in Fabric (for compatibility). The User accounts include:

Free
Pro
Premium Per User (PPU)

Each tailored to specific sets of capabilities as seen in the table below:

Image Credit: Microsoft (https://learn.microsoft.com/en-us/fabric/enterprise/licenses)

Understanding Licensing Scenarios

To optimally select the right Fabric licensing options and understand how they can be applied in real-world scenarios, it’s helpful to look at specific use cases within an organization. These scenarios highlight the practical benefits of choosing the right license type based on individual and organizational needs.

Scenario 1: When do you merely require a Power BI Pro License?

Consider the case of Sarah, a data analyst whose role involves creating and managing Power BI dashboards used organization-wide. These dashboards are critical for providing the leadership with the data needed to make informed decisions. In such a scenario, a Pro License is best because it allows Sarah to:

Create and manage Power BI dashboards within a dedicated workspace.
Set sharing permissions to control who can access the dashboards.
Enable colleagues to build their visualizations and reports from her Power BI datasets, fostering a collaborative work environment.

In the above scenario, a Pro license would suffice (based on the above-listed requirements.)

Scenario 2: What are the Licensing Options for Small Businesses?*

Consider a small business with about 60 users that wants to leverage premium Power BI features (pls. refer to the comparison table above which shows the capabilities for free, pro, and PPU (premium per user) to enhance its data analysis capabilities. The company has two primary licensing options within Microsoft Fabric to accommodate its needs, each with different cost implications and service access levels.

Option 1: Premium Per User (PPU) Licensing

This option involves purchasing a Premium Per User license for each of the 60 users.
Cost Calculation: 60 users x $20 per month = $1,200 per month.
Note: This option does not include any Fabric services or capacities; it only covers the Power BI Premium features.

Option 2: Combining F4 Capacity with Power BI Pro Licenses

Alternatively, the company can opt for a combination of an F4 Fabric capacity and 60 Power BI Pro licenses.
Cost Calculation: F4 capacity at $525 per month + (60 Power BI Pro licenses x $10 = $600) = $1,125 per month. Additional storage and other service costs may apply.
Benefits: This option is not only more cost-effective compared to Option 1, but it also provides access to broader Fabric services beyond just Power BI, enhancing the organization’s overall data management capabilities.

Option 2 offers a more economical and service-inclusive approach. Furthermore, it opens up opportunities to scale up using higher Fabric capacities with reserved (1-year) pricing for even greater efficiency and cost savings in the long run.

Table: Fabric SKU and Power BI SKUs for reference calculations and comparisons

Scenario 3: A Medium business organization is looking to implement analytics solutions using Fabric services and reporting using Power BI. They are also looking to share Power BI content for collaborative decision-making. What are the licensing options in Fabric?

Considerations:

1. Since the organization is looking to share Power BI content, you will need Power BI premium or equivalent Fabric capacities (F64 and above)
2. Microsoft is transitioning/enabling Power BI premium capacities to automatically be Fabric capacities – which brings more flexibility for organizations while keeping costs the same (when compared with PPU licenses)
3. It would be wise to start with F64 Pay-As-You-Go initially, check for performance and other factors such as bursting in the monthly bills, and finally decide on the final Fabric capacity with reserved pricing to avail up to 40% savings.

Scenario 4: An organization is looking to use Co-Pilot extensively across the platform. What Fabric capacity can they start with?

Considerations: A minimum of F64 SKU is required to be able to use Co-Pilot.

The table above provides a reference for understanding how different SKUs align with specific user needs and organizational goals, helping to further clarify the most effective licensing strategies for various roles within a company.

Key Considerations for selecting the right Fabric SKU and License

Now that we have seen some practical scenarios related to making licensing decisions, let us list out the key considerations for selecting the optimal Fabric SKU and license:

Organization Size & Usage Patterns:

A large organization with diverse data needs will likely require a higher-capacity SKU and more user licenses. Consider a mix of per-user and capacity licenses – analyze which teams work heavily in Fabric vs. those who are light consumers.
If your organization already uses Power BI extensively, or it’s central to your use of Fabric, having at least one Pro or PPU license is essential.

Workload Types and Frequency:

Batch vs. real-time processing: One-time bulk data migrations might benefit from short-term bursts, while continuous streaming needs consistent capacity.
Complexity of transformations: Resource-intensive tasks like complex data modeling, machine learning, or large-scale Spark jobs will consume more CUs than simple data movement.
Frequency of Power BI Use: Frequent dataset refreshes and report queries in Power BI significantly increase compute resource consumption.
Content Sharing/ CoPilot usage: To share the Power BI content freely across the organization or in order to use CoPilot, you must be on a minimum F64 or higher SKUs.

Operational Time:

Pay-as-you-go v/s Reserved (1-year) pricing: Reserved capacity locks in a price for consistent usage, while pay-as-you-go is better for sporadic workloads. The Reserved licensing provides roughly about 40% savings over the Pay-as-you-Go.
Pausing: You can pause your existing pay-as-you-go license when the capacity is not in use, resulting in cost savings.
Development vs. production: Dev environments can often use lower tiers or be paused when idle to reduce costs.

Region:

Costs vary by Azure region. Align your Fabric deployment with your primary user location to minimize data egress charges.

Power BI Premium: While Power BI licenses have not changed in Fabric, it is important to consider that the Power BI premium license would be merged with Fabric (F) licenses. The Free and Pro licenses would not be impacted.
Mixed Use: You may need to consider purchasing both Fabric (capacity) and Power BI licenses for sharing content across the organization.

How to Bring These Factors into Your Planning

Before beginning the Fabric deployment, consider these steps to ensure you choose the right SKU and licensing options:

Start with Baselining: Before scaling up, run pilot workloads to understand your capacity unit (CU) consumption patterns. This helps in accurately predicting resource needs and avoiding unexpected costs.
Estimate Growth: Project future data volumes, user counts, and evolving analytics needs. This foresight ensures that your chosen capacity can handle future demands without frequent upgrades.
Right-size, Don’t Overprovision: Initially, select an SKU that slightly exceeds your current needs. Microsoft Fabric’s flexibility allows you to scale up as necessary, preventing unnecessary spending on excess capacity.
Regularly Monitor Usage: Utilize the Capacity Metrics App to track resource usage and identify trends. This ongoing monitoring allows for timely adjustments and optimization of your resource allocation, ensuring cost-effectiveness.

Power BI Capacity Metrics App: Your Cost Control Center in Fabric

The Power BI Capacity Metrics App is an essential tool for understanding how different Microsoft Fabric components consume resources. It provides

Detailed reports and visualizations on the usage of computing and storage.
Empowers you to identify cost trends, potential overages, and optimization opportunities.
Helps you to stay within your budget.

Microsoft Fabric has streamlined licensing and pricing options, offering significant benefits at both capacity and storage levels:

Capacity Benefits

Image credits: Microsoft

Storage Benefits

In this blog, we’ve explored the intricacies of Microsoft Fabric’s pricing and licensing, along with practical considerations for making informed purchase decisions. If you want to integrate Fabric into your business, you can purchase the capacities and licenses from Azure Portal or reach out to us in case you need to discuss your use case.

The post A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric appeared first on Tiger Analytics.

Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh

Ibees . — Fri, 03 May 2024 05:35:55 +0000

D, a data engineer with a knack for solving complex problems, recently faced a challenging task. A client needed a smart way to manage their data in Power BI, especially after acquiring new companies. This meant separating newly acquired third-party data from their existing internal data, while also ensuring that historical data remained intact and accessible. The challenge? This process involved refreshing large data sets, sometimes as many as 25 million rows for a single year, just to incorporate a few thousand new entries. This task was not just time-consuming but would also put a strain on computational resources.

At first glance, Power BI’s Custom Partitions seemed like a promising solution. It would allow D to organize data neatly, separating third-party data from internal data as the client wanted. However, Power BI typically partitions data by date, not by the source or type of data, which made combining Custom Partitions with Incremental Refresh—a method that updates only recent changes rather than the entire dataset—a bit of a puzzle.

Limitations of Custom Partition and Incremental Refresh in Power BI

Custom Partitions offer the advantage of dividing the table into different parts based on the conditions defined, enabling selective loading of partitions during refreshes. However, Power BI’s built-in Incremental Refresh feature, while automated and convenient, has limitations in terms of customization. It primarily works on date columns, making it challenging to partition the table based on non-date columns like ‘business region’.

Incremental Refresh Pros:

Creation of partitions is automated, and the updation of partitions based on date is also automated, no manual intervention is needed.

Incremental Refresh Cons:

Cannot have two separate logics defined for partition of data based on flag column.
Cannot support the movement of data using the Power BI Pipeline feature.

Custom Partitions Pros:

Can create partitions of our own logical partitions.
Can support the movement of data using the Power BI Pipeline Feature.

Custom Partitions Cons:

All the processes should be done manually.

To tackle these challenges, D came up with another solution. By using custom C# scripts and Azure Functions, D found a way to integrate Custom Partitions with an Incremental Refresh in the Power BI model. This solution not only allowed for efficient management of third-party and internal data but also streamlined the refresh process. Additionally, D utilized Azure Data Factory to automate the refresh process based on specific policies, ensuring that data remained up-to-date without unnecessary manual effort.

This is how we at Tiger Analytics, solved our client’s problem and separated third-party data. In this blog, we’ll explore the benefits of combining Custom Partitions with Incremental Refresh. Based on our experiences, how this combination can enhance data management in Power BI and provide a more efficient and streamlined approach to data processing.

Benefits of combining Incremental Refresh with Custom Partitions in Power BI

Merging the capabilities of Incremental Refresh with Custom Partitions in Power BI offers a powerful solution to overcome the inherent limitations of each approach individually. This fusion enables businesses to fine-tune their data management processes, ensuring more efficient use of resources and a tailored fit to their specific data scenarios.

Leveraging tools like Azure Function Apps, the Table Object Model (TOM) library, and Power BI’s XMLA endpoints, automating the creation and management of Custom Partitions becomes feasible. This automation grants the flexibility to design data partitions that meet precise business needs while enjoying the streamlined management and automated updates provided by Power BI.

Optimizing Query Performance:

Custom Partitions improve query performance by dividing data into logical segments based on specific criteria, such as a flag column.
When combined with an Incremental Refresh, only the partitioned data that has been modified or updated needs to be processed during queries.
This combined approach reduces the amount of data accessed, leading to faster query response times and improved overall performance.

Efficient Data Refresh:

Incremental Refresh allows Power BI to refresh only the recently modified or added data, reducing the time and resources required for data refreshes.
When paired with Custom Partitions, the refresh process can be targeted to specific partitions, rather than refreshing the entire dataset.
This targeted approach ensures that only the necessary partitions are refreshed, minimizing processing time and optimizing resource utilization.

Enhanced Data Organization and Analysis:

Custom Partitions provide a logical division of data, improving data organization and making it easier to navigate and analyze within the data model.
With Incremental Refresh, analysts can focus on the most recent data changes, allowing for more accurate and up-to-date analysis.
The combination of Custom Partitions and Incremental Refresh enables more efficient data exploration and enhances the overall data analysis process.

Scalability for Large Datasets:

Large datasets can benefit significantly from combining Custom Partitions and Incremental Refresh.
Custom Partitions allow for efficient querying of specific data segments, reducing the strain on system resources when dealing with large volumes of data.
Incremental Refresh enables faster and more manageable updates to large datasets by focusing on the incremental changes, rather than refreshing the entire dataset.

Implementation Considerations:

Combining Custom Partitions and Incremental Refresh may require a workaround, such as using calculated tables and parameters.
Careful planning is necessary to establish relationships between the partition table, data tables, and Incremental Refresh configuration.
Proper documentation and communication of the combined approach are essential to ensure understanding and maintainability of the solution.

How to implement Incremental Refresh and Custom Partitions: A step-by-step guide

Prerequisites:

Power BI Premium Capacity or PPU License: The use of XMLA endpoints, which are necessary for managing Custom Partitions, is limited to Power BI Premium capacities. Alternatively, you can utilize Power BI premium-per-user (PPU) licensing to access these capabilities.
PPU: https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-per-user-faq
Xmla Reference: https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-connect-tools

Dataset Published to Premium Workspace: The dataset for implementing Custom Partitions and Incremental Refresh should be published to a Power BI Premium workspace.

Permissions for Azure Functions and Power BI Admin Portal: To automate the creation and management of Custom Partitions, you need the appropriate permissions. This includes the ability to create and manage Azure Functions and the necessary rights to modify settings in Power BI’s Admin portal.

In the Function App, Navigate to Settings -> Identity and Turn On the system assigned.
Next, create a security group in Azure and add the function App as a member.
Go to Power BI, navigate to the Admin portal, and add the security group to the Admin API setting that allows service principles to use Power BI APIs.
Go to Workspace, Go for access, and Add the function as a member to the Workspace.

Check Incremental Refresh Policy: The Incremental Refresh policy needs to be false to create partitions on the table (through code).

Fulfilling these prerequisites will enable effective utilization of Custom Partitions and Incremental Refresh in Power BI.

Implementation at a glance:

Create an Azure Function with .NET as the Runtime Stack: Begin by adding the necessary DLL files for Power BI model creation and modification to the Azure Function console.

Connect to the Power BI Server Using C# Code: Establish a connection by passing the required connection parameters, such as the connection string and the table name where partitions need to be implemented. (C# code and additional details are available in the GitHub link provided in the note section).

Develop Code for Creating Partitions: Utilize the inbuilt functions from the imported DLL files to create partitions within the Power BI server.

Implement Code for Multiple Partitions: Use a combination of for-loop and if-conditions to develop code capable of handling multiple partitions.

There are two types of data partitions to consider based on the Flag value:

Flag Value ‘Y’ will be stored in a single partition, referred to as the ABC Partition.
Flag Value ‘N’ will be partitioned based on the date column, adhering to the incremental logic implemented. (Examples of partition naming include 2020, 2021, 2022, 202301, 202302, 202303, etc., up to 202312, 202401, 202402).

Check and Create ABC Partition if it does not exist: The initial step in the logic involves verifying the existence of the ABC Partition. If it does not exist, the system should create it.

Implement Logic Within the Looping Block:

The first action is to verify the existence of yearly partitions for the last three years. If any are missing, they should be created.
Next, combine all partitions from the previous year into a single-year partition.
Subsequently, create new partitions for the upcoming year until April.
Any partitions outside the required date range should be deleted.

Automate Partition Refresh with a Pipeline: Establish a pipeline designed to trigger the Azure function on the 1st of April of every year, aligning with the business logic.

Partition Logical flow:

Step-by-Step implementation:

From the home dashboard, search for and select the Function App service. Enter all necessary details, review the configuration, and click ‘Create’.
Configure the function’s runtime settings.
Check the Required dll’s
Navigate to Development Tools -> Advanced Tools -> Go
Give CMD -> site -> wwwroot -> new function name -> bin and paste all dll’s
The primary coding work, including the creation of partitions, is done in the run.csx file. This is where you’ll write the C# code.

The Partitions should be as below:

Input body to the function:

{
    "connectionstring":"Connection details",
    "datasource": "datasource name",
    "workspace": "workspace name",
    "dataset": "dataset name",
    "table": "Table name",
    "partition": "Y",
    "sourceschema": "Schema name",
    "sourceobject": "Source object Name table or view name",
    "partitionstatement": "Year",
    "history": "2"
}

Refresh the selected partition using Azure Pipeline:

Create Azure Pipeline, which uses web activity to call the Rest-API refresh method in the Power BI model.
The first step for using the Pipeline is to have the APP registered with Power Bi workspace and Model access.
Then, with the APP, get the AAD Token for authentication.
With the AAD Token, use the In-Built refresh POST methods in Rest-API for refreshing the required table and partition.
To make the Pipeline wait till the refresh is complete, use the In-Built refreshes GET methods in Rest-API. Implementing GET methods within the pipeline to monitor the refresh status, ensures the process completes successfully.
The Pipeline is built in a modular way, where workspaceID and DatasetID and Table name and partition name are passed.
The pipeline can call any model refresh until the API used in the Pipeline has access to the Model and Workspace.

What does each activity in the pipeline mean:
- Get Secret from AKV: This block of pipeline accepts the key vault URL and the secret name which has a secret for an app used to access Power BI. The output of this block is a secret value.
- Get AAD Token: This block accepts the input of tenant id, app id, and output of Get Secret from AKV which gives an output as a token through which enables access to the Power BI model.
- Get Dataset Refresh: This block accepts the input of workspace id, dataset id, body, and then token which we get from the 2nd block then this block triggers the refresh of the corresponding table and partitions that are passed through the body for the model. This block will follow the post method.
Until Refresh Complete:
- Wait: To ensure the refresh completes, this block checks every 10 seconds.
- Get Dataset: This involves inputting the workspace ID, dataset ID, and request body, following the GET method. The output is a list of ongoing refreshes on the model.
- Set Dataset: Assigning the output of the previous block to a variable
  This block will run till the variable is not equal to unknown.
- If Condition: This step checks if the refresh process has failed. If so, the pipeline’s execution is considered unsuccessful.

Refresh the selected partition using Azure Function:

Please follow the same steps as above from 1- 6 to create the Azure function for refresh.
In the code+test pane add the c# code shared in the github.

Input body to the function:

{
    "workspace": "Workspace Name",
    "dataset": "Semantic Model Name",
    "tables": [
        {
            "name": "Table1RefreshallPartitions",
            "refreshPartitions": false
        },
        {
            "name": "Table2Refreshselectedpartitions",
            "refreshPartitions": true,
            "partitions": [
                "202402",
                "202403",
                "202404"
            ]
        }
    ]
}

Both Incremental Refresh and Custom Partitions in Power BI are essential for efficiently managing data susceptible to change within a large fact table. They allow you to optimize resource utilization, reduce unnecessary processing, and maintain control over partition design to align with your business needs. By combining these features, you can overcome the limitations of each approach and ensure a streamlined and effective data management solution.

References:

https://www.tackytech.blog/how-to-automate-the-management-of-custom-partitions-for-power-bi-datasets/

Note: Access the following GitHub link for Azure Function code, body which we pass for function and pipeline JSON files. Copy the JSON file inside the pipeline folder and paste that in adf pipeline by renaming the pipeline name as mentioned in the file, you will get the pipeline.

https://github.com/TirumalaBabu2000/Incremental_Refresh_and_Custom_Partition_Pipeline.git

The post Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh appeared first on Tiger Analytics.

Decoding the Dilemma: AI-Driven Analytics Products – To Build or To Buy?

onemg — Tue, 06 Oct 2020 14:36:56 +0000

Data fuels digital transformations. These days there’s a start-up on every corner touting AI and Big Data solutions. Large product companies are expanding their offerings to include insight solutions. Every consulting company is developing its own product to increase client stickiness. Product sales and subscription revenue attract preposterously high valuations, a mouth-watering prospect indeed for any business, regardless of size.

It goes without saying that companies offering these analytics products have a lot to gain. But is it the right business decision to invest in AI/Big Data/Advanced Analytics products for your company? If your industry rivals get on the bandwagon, what happens to your competitive advantage? If data is the new gold, will the small clause about Intellectual Property (IP), lurking unobtrusively in your contract, give away the keys to your kingdom?

Off-the-shelf analytics products and the lure of omniscience

If data is ore, insights are the precious metal within and every modern organization is sitting on a fortune. Most organizations recognize the strategic value of their data and are building in-house analytics teams.

Uncovering value at speed is vital to competitive advantage and teams need time to scale, so every business function is in search of a solution that gets quick results.

It’s an archetype of cartoons, cinema, and TV: serious-looking men and women key in the problem, and after some completely unnecessary beeps and boops, the omniscient supercomputer spits out the answer. It’s tantalizing to think of your company owning a ready-to-use AI-driven platform that magically solves all your business problems.

Whether it’s forecasting; revenue optimization; strategic pricing; supplier analytics; or market mix modeling, product companies claim they’ve incorporated other players’ learnings into their own offerings, saving you, the buyer, time you would otherwise have spent in experimentation.

What’s wrong with this picture?

Buying off the shelf sounds attractively simple, but as is so often the case, the devil is in the details:

A standardized product isn’t a particularly effective strategic differentiator, because your competitors can buy it too.
Each product company has its own product roadmap, and the differentiating capabilities your business is looking for maybe quite far down the road. You may end up waiting a long time for a solution that really meets your needs–which themselves would have evolved while you were waiting.
Let’s say your service provider builds the features you want into their product, and let’s also say those features are built just the way you want them: with your business experience, use cases, and expertise feeding the algorithms. This is great until you realize that you’ve shared all your tacit knowledge to improve the solution, but the fine print of your contract says it’s your service provider who owns the IP – which is now available to all your competitors!
Most niche product companies are built on the promise of a high valuation that will ultimately attract the acquisitive attention of a larger player. This invariably results in the dissolution of the acquired entity and a change in priorities, or worse, the retirement of the product you pinned your hopes on.
Most analytics products solve problems in silos using data at hand today. In the future, there may be new data sources, business models, and technologies available. Your business may need cross-functional perspectives that a readymade solution can’t support.

Consider these scenarios from the CPG industry:

You procured a forecasting solution all ready to use. Your cloud provider has developed a new algorithm that looks promising, but you’re forced to wait for your solution provider to upgrade so you can try it.
You bought one solution for trade optimization and another for market mix modeling. Now your company wants to optimize spending across both, but the platforms can’t cross-pollinate.
You bought a tool for strategic pricing which included a volume transfer matrix. You are rationalizing SKUs and need to estimate demand transferability. You will either need to develop this capability from scratch or pay top dollar to the provider because the tool itself is a black box.

In a nutshell, buying ready-to-use means committing your company to a suboptimal solution that gives you no sustained advantage and that will only drift further from your needs over time.

So bespoke has to be better, right?

Not quite

Building a customized AI-driven solution from the ground up is fraught with its own risks:

Data acquisition, quality, and harmonization are bigger challenges – and far more common – than you would think.
You will have to experiment with, create, and train the models-all necessary but time-consuming activities that delay value realization from your data.
You may not have the right talent to develop and scale your solution.

This sounds like a classic Build vs. Buy problem!

The times, they are a-changin’

In the ‘traditional’ software world, most large companies implementing ERP solutions gravitated towards analytics products that met their transactional needs. This made sense, because:

There’s limited differentiation in running transactional processes.
Core processes don’t change drastically over time.
The huge maintenance cost precludes the development of custom solutions.

Insight solutions are different. The underlying technologies are evolving so fast that solutions need to change all the time just to keep up.

For example, we developed a forecasting solution originally using an ARIMAX model. One year later, there was Facebook Prophet and then there were Deep State Models, which are more accurate and easy to maintain though not explainable. The solution needed to evolve to keep up with the available options.

Your business’s competitive advantage comes from continually evolving and improving the performance of your insight models. You can hire service providers to do this for you or employ data scientists yourself, as do a growing number of companies these days.

Open IP: the winning balance

We believe that if insight is a competitive differentiator, a business should never tie itself down with a product whose IP is a black box. Your company’s insight solution should be maintainable, expandable, and upgradable independent of the service provider.

That’s why Tiger Analytics takes an Open IP approach: we invest in developing business solutions and accelerators for clients like you where you keep access to the IP and source code.

By complementing your business expertise, this unique approach reduces the time to value by leveraging our IP and prowess with data to get your insight solution operational and scaled up much quicker.

Proprietary analytics products and solutions can never fully give your business the lead it needs. By adopting an Open IP approach, your company will retain the ability to add competitive differentiation without losing your competitive edge.

What do you think of the Open IP approach? Is Build vs. Buy really a game with no winners? Is there another way?

The post Decoding the Dilemma: AI-Driven Analytics Products – To Build or To Buy? appeared first on Tiger Analytics.

Decoding the 80/20 Rule: Data Science and the Pareto Principle

onemg — Mon, 18 Mar 2019 19:39:39 +0000

More than a century ago, Vilfredo Pareto, a professor of Political Economy, published the results of his research on the distribution of wealth in society. The dramatic inequalities he observed, e.g., 20% of the people owned 80% of the wealth, surprised economists, sociologists, and political scientists. Over the last century, several pioneers in varied fields observed this disproportionate distribution in several situations, including business. The theory that a vital few inputs/causes (e.g. 20% of inputs) directly influence a significant majority of the outputs/effects (e.g. 80% of outputs) came to be known as the Pareto Principle – also referred to as the 80-20 rule.

Pareto Principle

It is a very simple yet extremely powerful management tool. Business executives have long used it for strategic planning and decision making. Observations such as 20% of the stores generate 80% of the revenue, 20% of software bugs cause 80% of the system crashes, 20% of the product features drive 80% of the sales, etc., are popular, and analytically savvy businesses try to find such Paretos in their worlds. This way they are able to plan and prioritize their actions. In fact, today, data science plays a big role in sifting through tons of complex data to help identify future Paretos.

While data science is helping predict new Paretos for businesses, data science can benefit from taking a look internally, searching for Paretos within. Exploiting these can make data science significantly more efficient and effective. In this article, I’ll share a few ways in which we, as data scientists, can use the power of the Pareto Principle to guide our day-to-day activities.

Project Prioritization

If you are a data science leader/manager, you’d inevitably need to help develop the analytics strategy for your organization. While different business leaders can share their needs, you have to articulate all these organizational (or business unit) needs and prioritize them into an analytic roadmap. A simple approach is to quantify the value of solving each analytic need and sort them in the decreasing order of value. You’ll often notice that the top few problems/use-cases are disproportionately valuable (Pareto Principle), and should be prioritized above the others. In fact, a better approach would be to quantify the complexity of solving/implementing each problem/use-case, and prioritizing them based on a trade-off between value and complexity (e.g. by laying them on a plot with value on the y-axis and complexity on the x-axis).

Problem Scoping

Business problems tend to be vague and unstructured, and a data scientist’s job involves identifying the right scope. Scoping often requires keeping the focus on the most important aspects of the problem and deprioritizing aspects that are of less value. To start with, looking at the distribution of outputs/effects over inputs/causes will help us understand if high-level Paretos exist in the problem space. Subsequently, we can choose to look at only certain inputs/outputs or causes/effects. For example, if 20% of stores generate 80% of sales, we can group the rest of the stores into a cluster and do the analysis instead of evaluating them individually.

Scoping also involves evaluating risks – deeper evaluation will often tell us that the top items pose a significantly higher risk while the bottom ones have a very remote chance of occurring (the Pareto Principle). Rather than address all risks, we can possibly prioritize our time and efforts towards a few of the key risks.

Data Planning

Complex business problems require data beyond what is readily available in analytic data marts. We need to request access, purchase, fetch, scrape, parse, process, and integrate data from internal/external sources. These come in different shapes, sizes, health, complexity, costs etc. Waiting for the entire data plan to fall in place can cause project delays that are not in our control. One simple approach could be to categorize these data needs based on their value to the end solution, e.g. Absolute must have, Good to have, and Optional (the Pareto Principle). This will help us focus on the Absolute must-haves, and not get distracted or delayed by the Optional items. In addition to value, considering aspects of cost, time, and effort of data acquisition will help us better prioritize our data planning exercise.

Analysis

It’s anecdotally said that a craftsman completes 80% of their work using only 20% of their tools. This holds true for us data scientists as well. We tend to use a few analyses and models for a significant part of our work (the Pareto Principle), while the other techniques get used much less frequently. Typical examples during exploratory analysis include variable distributions, anomaly detection, missing value imputation, correlation matrices etc. Similarly, examples during the modeling phase include k-fold cross-validation, actual vs. predicted plots, misclassification tables, analyses for hyperparameter tuning etc. Building mini automation (e.g. libraries, code snippets, executables, UIs) to use/access/implement these analyses can bring significant efficiencies in the analytic process.

Modeling

During the modeling phase, it doesn’t take us long to arrive upon a reasonable working model early in the process. Majority of the accuracy gains have been made by now (the Pareto Principle). The rest of the process is about fine-tuning the models and pushing for the incremental accuracy gains. Sometimes, the incremental accuracy gains are required to make the solution viable for business. On other occasions, the model fine-tuning doesn’t add much value to the eventual insight/proposition. As data scientists, we need to be cognizant of these situations, so that we know where to draw the line accordingly.

Business Communication

Today’s data science ecosystem is very multi-disciplinary. Teams include business analysts, machine learning scientists, big data engineers, software developers, and multiple business stakeholders. A key driver of the success of such teams is communication. As someone who is working hard, you might be tempted to communicate all the work – challenges, analyses, models, insights etc. However, in today’s world of information overload, taking such an approach will not help. We will need to realize that there are ‘useful many but a vital few’ (the Pareto Principle) and use this understanding to simplify the amount of information we communicate. Similarly, what we present and highlight needs to be customized based on the target audience (business stakeholders vs. data scientists).

The Pareto Principle is a powerful tool in our arsenal. Used the right way, it can help us declutter and optimize our activities.

First published on – www.kdnuggets.com/2016/06/identify-right-data-your-team.html

The post Decoding the 80/20 Rule: Data Science and the Pareto Principle appeared first on Tiger Analytics.