Business Intelligence Archives - Tiger Analytics Thu, 16 Jan 2025 09:40:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://www.tigeranalytics.com/wp-content/uploads/2023/09/favicon-Tiger-Analytics_-150x150.png Business Intelligence Archives - Tiger Analytics 32 32 What is Data Observability Used For? https://www.tigeranalytics.com/perspectives/blog/what-is-data-observability-used-for/ Fri, 27 Sep 2024 10:35:54 +0000 https://www.tigeranalytics.com/?post_type=blog&p=23649 Learn how Data Observability can enhance your business by detecting crucial data anomalies early. Explore its applications in improving data quality and model reliability, and discover Tiger Analytics' solution. Understand why this technology is attracting major investments and how it can enhance your operational efficiency and reduce costs.

The post What is Data Observability Used For? appeared first on Tiger Analytics.

]]>
Imagine you’re managing a department that handles account openings in a bank. All services seem fine, and the infrastructure seems to be working smoothly. But one day, it becomes clear that no new account has been opened in the last 24 hours. On investigation, you find that this is because one of the microservices involved in the account opening process is taking a very long time to respond.

For such a case, the data analyst examining the problem can use traces with triggers based on processing time. But there must be an easier way to spot anomalies.
Traditional monitoring involves recording the performance of the infrastructure and applications. Data observability allows you to track your data flows and find faults in them (may even extend to business processes). While traditional tools analyze infrastructure and applications using metrics, logs, and traces, data observability uses data analysis in a broader sense.

So, how do we tackle the case of no new account creation in 24 hours?

The data analyst could use traces with time-based triggers. There has to be an easier way of detecting potential anomalies on site.

A machine learning model is used to predict future events, such as the volume of future sales, by utilizing regularly updated historical data. However, because the input data may not always be of perfect quality, the model can sometimes produce inaccurate forecasts. These inaccuracies can lead to either excess inventory for the retailer or, worse, out-of-stock situations when there is consumer demand.

Classifying and Addressing Unplanned Events

The point of Data Observability is to identify so-called data downtime. Data Downtime refers to a sudden unplanned event in your business/infrastructure/code that leads to a sudden change in the data. In other words, it is the process of finding anomalies in data.

How can you classify these events?

  • Exceeding a given metric value or an abnormal jump in a given metric. This type is the simplest. Imagine that you add 80-120 clients every day (confidence interval with some probability), and in one day, only 20. Perhaps something caused it to drop suddenly, and it’s worth looking into.
  • Abrupt change in data structure. Let’s take a past example with clients. Everything was fine, but one day, the contact information field began to receive empty values. Perhaps something has broken in your data pipeline, and it’s better to check.
  • The occurrence of a certain condition or deviation from it. Just as GPS coordinates should not show a truck in the ocean, banking transactions should not suddenly appear in unexpected locations or in unusual amounts that deviate significantly from the norm.
  • Statistical anomalies. During a routine check, the bank’s analysts notice that on a particular day, the average ATM withdrawal per customer spiked to $500, which is significantly higher than the historical average.

On the one hand, it seems that there is nothing new in this approach of classifying abnormal events and taking the necessary remedial action. But on the other hand, previously there were no comprehensive and specialized tools for these tasks.

Data Observability is Essential for Ensuring Fresh, Accurate, and Smooth Data Flow

Data observability serves as a checkup for your systems. It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

Persona Why Question Observability Use case Business Outcome
Business User
  • WHY Data quality metrics are in Amber/Red
  • WHY is my dataset/report not accurate
  • WHY do I see a sudden demand for my product and what is the root cause
Data Quality, Anomaly Detection and RCA
  • Improve the quality of insights
  • Boost trust and confidence in decision making
Data Engineers/Data Reliability Engineers
  • WHY there is data downtime
  • WHY did the pipeline fail
  • WHY there is an SLA breach in Data Freshness
Data Pipeline Observability, Troubleshooting and RCA
  • Better Productivity
  • Speed up MTTR
  • Enhance Pipeline efficiency
  • Intelligent Triaging
Data Scientists
  • WHY the model predictions are not accurate
Data Quality Model
  • Improve Model Reliability

Tiger Analytics’ Continuous Observability Solution

Continuous monitoring and alerting of potential issues (gathered from various sources) before a customer/operations reports an issue. Consists of Set of tools, patterns and practices to build Data Observability components for your big data workloads in Cloud platform to reduce DATA DOWNTIME.

Select examples of our experience in Data observability and Quality
client-and-use-case

Tools and Technology

Data-Observability

Tiger Analytics Data Observability is set of tools, patterns and best practices to:

  • Ingest MELT(Metrics, Events, Logs, Traces) data
  • Enrich, Store MELT for getting insights on Event & Log Correlations, Data Anomalies, Pipeline Failures, Performance Metrics
  • Configure Data Quality rules using a Self Service UI
  • Monitor Operational Metrics like Data quality, Pipeline health, SLAs
  • Alert Business team when there is Data Downtime
  • Perform Root cause analysis
  • Fix broken pipelines and data quality issues

Which will help:

  • Minimize data downtime using automated data quality checks
  • Discover data problems before they impact the business KPIs
  • Accelerate Troubleshooting and Root Cause Analysis
  • Boost productivity and reduce operational cost
  • Improve Operational Excellence, QoS, Uptime

Data observability and Generative AI (GenAI) can play crucial roles in enhancing data-driven decision-making and machine learning (ML) model performance.

The combination of data observability primes the pump by instilling confidence with smooth sailing, high-quality and always available data which forms a foundation for any data-driven initiative while GenAI enables to realize what is achievable through it, opening up new avenues into how we can simulate, generate or even go beyond innovate. Organizations can use both to improve their data capabilities, decision-making processes, and innovation with different areas.

Thus, Monte Carlo, a company that produces a tool for data monitoring, raised $135 million, Observe – $112 million, Acceldata – $100 million have an excellent technology medium in the Data Observability space.

To summarize

Data Observability is an approach to identifying anomalies in business processes and the operation of applications and infrastructure, allowing users to quickly respond to emerging incidents.It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

And if there is no particular novelty in technology, there is certainly novelty in the approach, tools and new terms that make it possible to better convince investors and clients. The next few years will show how successful new players will be in the market.

References

https://www.oreilly.com/library/view/data-observability-for/9781804616024/
https://www.oreilly.com/library/view/data-quality-fundamentals/9781098112035/

The post What is Data Observability Used For? appeared first on Tiger Analytics.

]]>
Implementing Context Graphs: A 5-Point Framework for Transformative Business Insights https://www.tigeranalytics.com/perspectives/blog/implementing-context-graphs-a-5-point-framework-for-transformative-business-insights/ Wed, 04 Sep 2024 05:49:00 +0000 https://www.tigeranalytics.com/?post_type=blog&p=23370 This comprehensive guide outlines three phases: establishing a Knowledge Graph, developing a Connected Context Graph, and integrating AI for auto-answers. Learn how this framework enables businesses to connect data points, discover patterns, and optimize processes. The article also presents a detailed roadmap for graph implementation and discusses the integration of Large Language Models with Knowledge Graphs.

The post Implementing Context Graphs: A 5-Point Framework for Transformative Business Insights appeared first on Tiger Analytics.

]]>
Remember E, the product manager who used Context Graphs to unravel a complex web of customer complaints? Her success story inspired a company-wide shift in data-driven decision-making.

“This approach could change everything,” her CEO remarked during her presentation. “How do we implement it across our entire operation?”

E’s answer lay in a comprehensive framework designed to unlock the full potential of their data. In this article, we’ll explore Tiger Analytics’ innovative 5-point Graph Value framework – a roadmap that guides businesses from establishing a foundational Knowledge Graph to leveraging advanced AI capabilities for deeper insights.

The 5-Point Graph Value

At Tiger Analytics, we have identified a connected 5-point Graph Value framework that enables businesses to unlock the true potential of their data through a phased approach, leading to transformative insights and decision-making. The 5-point Graph Value framework consists of three distinct phases, each building upon the previous one to create a comprehensive and powerful solution for data-driven insights.

Five-Point-Graph-values

Phase 1: Knowledge Graph (Base)

The first phase focuses on establishing a solid foundation with the Knowledge Graph. This graph serves as the base, connecting all the relevant data points and creating a unified view of the business ecosystem. By integrating data from various sources and establishing relationships between entities, the Knowledge Graph enables businesses to gain a holistic understanding of their operations.

In this phase, two key scenarios demonstrate the power of the Knowledge Graph:

1. Connect All Dots
Role-based Universal View: Gaining a Holistic Understanding of the Business
A business user needs to see a connected view of Product, Plant, Material, Quantity, Inspection, Results, Vendor, PO, and Customer complaints. With a Knowledge Graph, this becomes a reality. By integrating data from various sources and establishing relationships between entities, the Knowledge Graph provides a comprehensive, unified view of the product ecosystem. This enables business users to gain a holistic understanding of the factors influencing product performance and customer satisfaction, leading to context-based insights for unbiased actions.

2. Trace & Traverse
Trace ‘Where Things’: Context-based Insights for R&D Lead
An R&D Lead wants to check Package material types and their headspace combination patterns with dry chicken batches processed in the last 3 months. With a Knowledge Graph, this information can be easily traced and traversed. The graph structure allows for efficient navigation and exploration of the interconnected data, enabling the R&D Lead to identify patterns and insights that would otherwise be hidden in traditional data silos. This trace and traverse capability empowers the R&D Lead to make informed decisions based on a comprehensive understanding of the data landscape.

Phase 2: Connected Context Graph

Building upon the Knowledge Graph, the second phase introduces the Connected Context Graph. This graph incorporates the temporal aspect of data, allowing businesses to discover patterns, track changes over time, and identify influential entities within their network.

Two scenarios showcase the value of the Connected Context Graph

3. Discover more Paths & Patterns
Uncover Patterns: Change History and its weighted impacts for an Audit
An auditor wants to see all the changes that happened for a given product between 2021 and 2023. With a Connected Context Graph, this becomes possible. The graph captures the temporal aspect of data, allowing for the discovery of patterns and changes over time. This enables the auditor to identify significant modifications, track the evolution of the product, and uncover potential areas of concern. The Connected Context Graph provides valuable insights into the change history and its weighted impacts, empowering the auditor to make informed decisions and take necessary actions.

4. Community Network
Network Community: Identifying Influencers and Optimizing Processes
A business user wants to perform self-discovery on the Manufacturer and Vendor network for a specific Plant, Products, and Material categories within a specific time window. The Connected Context Graph enables the identification of community networks, revealing the relationships and interdependencies between various entities. This allows the business user to identify key influencers, critical suppliers, and potential risk factors within the network. By understanding the influential entities and their impact on the supply chain, businesses can optimize their processes and make strategic decisions to mitigate risks and improve overall performance.

Phase 3: Auto-Answers with AI

The final phase of the 5-point Graph Value framework takes the insights derived from the Knowledge Graph and Connected Context Graph to the next level by augmenting them with AI capabilities. This phase focuses on leveraging AI algorithms to identify critical paths, optimize supply chain efficiency, and provide automated answers to complex business questions.

The scenario in this phase illustrates the power of AI integration:

5. Augment with AI
Optimizing Supply Chain Critical Paths and Efficiency
A Transformation Lead wants to identify all the critical paths across the supply chain to improve green scores and avoid unplanned plant shutdowns. By augmenting the Knowledge Graph with AI capabilities, this becomes achievable. AI algorithms can analyze the graph structure, identify critical paths, and provide recommendations for optimization. This enables the Transformation Lead to make data-driven decisions, minimize risks, and improve overall operational efficiency. The integration of AI with the Knowledge Graph opens up new possibilities for business process optimization, workflow streamlining, and value creation, empowering organizations to stay ahead in today’s competitive landscape.

A 360-Degree View of Your Product with Context Graphs

By leveraging Knowledge Graphs, businesses can unlock a complete 360-degree view of their products, encompassing every aspect from raw materials to customer feedback. Graph capabilities enable organizations to explore the intricate relationships between entities, uncover hidden patterns, and gain a deeper understanding of the factors influencing product performance. From context-based search using natural language to visual outlier detection and link prediction, graph capabilities empower businesses to ask complex questions, simulate scenarios, and make data-driven decisions with confidence. In the table below, we will delve into the various graph capabilities that can enhace the way you manage and optimize your products.

Five-Steps-for-Graph-values

Use Cases of Context Graphs Across Your Product

Use-Cases-of-Context-Graphs

Graph Implementation Roadmap

The adoption of Context Graphs follows a structured roadmap, encompassing various levels of data integration and analysis:

  • Connected View (Level 1): The foundational step involves creating a Knowledge Graph (KG) that links disparate enterprise data sources, enabling traceability from customer complaints to specific deviations in materials or processes.
  • Deep View (Level 2): This level delves deeper into the data, uncovering hidden insights and implicit relationships through pattern matching and sequence analysis.
  • Global View (Level 3): The focus expands to a global perspective, identifying overarching patterns and predictive insights across the entire network structure.
  • ML View (Level 4): Leveraging machine learning, this level enhances predictive capabilities by identifying key features and relationships that may not be immediately apparent.
  • AI View (Level 5): The pinnacle of the roadmap integrates AI for unbiased, explainable insights, using natural language processing to facilitate self-discovery and proactive decision-making.

Graph-Implementation-Roadmap

Leveraging LLMs and KGs

A significant advancement in Context Graphs is the integration of Large Language Models (LLMs) with Knowledge Graphs (KGs), addressing challenges such as knowledge cutoffs, data privacy, and the need for domain-specific insights. This synergy enhances the accuracy of insights generated, enabling more intelligent search capabilities, self-service analytics, and the construction of KGs from unstructured data.

Context Graph queries are revolutionizing our machine learning and AI systems. They are enabling these systems to make informed and nuanced decisions swiftly. With these tools, we can preemptively identify and analyze similar patterns or paths in raw materials lots even before they commence the manufacturing process.

This need to understand the connections between disparate data points is reshaping how we store, connect, and interpret data, equipping us with the context needed for more proactive and real-time decision-making. The evolution in how we handle data is paving the way for a future where immediate, context-aware decision-making becomes a practical reality.

The post Implementing Context Graphs: A 5-Point Framework for Transformative Business Insights appeared first on Tiger Analytics.

]]>
Connected Context: Introducing Product Knowledge Graphs for Smarter Business Decisions https://www.tigeranalytics.com/perspectives/blog/connected-context-introducing-product-knowledge-graphs-for-smarter-business-decisions/ Wed, 04 Sep 2024 05:38:52 +0000 https://www.tigeranalytics.com/?post_type=blog&p=23364 Explore how Product Knowledge Graphs, powered by Neo4j, are reshaping data analytics and decision-making in complex business environments. This article introduces the concept of Connected Context and illustrates how businesses can harness graph technology to gain deeper insights, improve predictive analytics, and drive smarter strategies across various functions.

The post Connected Context: Introducing Product Knowledge Graphs for Smarter Business Decisions appeared first on Tiger Analytics.

]]>
E, a seasoned product manager at a thriving consumer goods company, was suddenly in the throes of a crisis. The year 2022 began with an alarming spike in customer complaints, a stark contrast to the relatively calm waters of 2021. The complaints were not limited to one product or region; they were widespread, painting a complex picture that E knew she had to decode.

The company’s traditional methods of analysis, rooted in linear data-crunching, were proving to be insufficient. They pointed to various potential causes: a shipment of substandard raw materials, a series of human errors, unexpected deviations in manufacturing processes, mismatches in component ratios, and even inconsistent additives in packaging materials. The list was exhaustive, but the connections were elusive.

The issue was complex-no single factor was the culprit. E needed to trace and compare the key influencers and their patterns, not just within a single time frame but across the tumultuous period between 2021 and 2022. The domino effect of one small issue escalating into a full-blown crisis was becoming a daunting reality.

To trace the key influencers and their patterns across the tumultuous period between 2021 and 2022, E needed a tool that could capture and analyze the intricate relationships within the data. At Tiger Analytics, we recognized the limitations of conventional approaches and introduced the concept of the Product Knowledge Graph, powered by Neo4j. The concept of the Context Graph, a term we coined to describe a specialized graph-based data structure. This specialized sub-graph from the Master Graph emphasized the contextual information and intricate connections specific to the issue at hand. It provided a visual and analytical representation that weighted different factors and their interrelations.

Why-Graph

Why-Graph

The Context Graph illuminated the crucial 20% of factors that were contributing to 80% of the problems—the Pareto Principle in action. By mapping out the entire journey from raw material to customer feedback, the Context Graph enabled E to pinpoint the specific combinations of factors that were causing the majority of the complaints. With this clarity, E implemented targeted solutions to the most impactful issues.

What is a Context Graph and Why we need it?

In today’s complex business landscape, traditional databases often fall short in revealing crucial relationships within data. Context Graphs address this limitation by connecting diverse data points, offering a comprehensive view of your business ecosystem.

“The term Context Graph refers to a graph-based data structure (sub-graph from Master Graph) used to represent the contextual information, relationships, or connections between data entities, events, and processes at specific points at the time. It might be used in various applications, such as enhancing natural language understanding, recommendation systems, or improving the contextual awareness of artificial intelligence.”

At Tiger Analytics, we combine graph technology with Large Language Models to build Product Knowledge Graphs, unifying various data silos like Customer, Batch, Material, and more. The power of Context Graphs lies in their ability to facilitate efficient search and analysis from any starting point. Users can easily query the graph to uncover hidden insights, enhance predictive analytics, and improve decision-making across various business functions.

By embracing Context Graphs, businesses gain a deeper understanding of their operations and customer interactions, paving the way for more informed strategies and improved outcomes.

Connected-Context-Graph

Connected-Context-Graph

This comprehensive approach is set to redefine the landscape of data-driven decision-making, paving the way for enhanced predictive analytics, risk management, and customer experience.

6 Ways Graphs Enhance Data Analytics

Why-Graph-DB

1. Making Connections Clear: If data is like a bunch of dots, by itself, each dot doesn’t tell you much. A Context Graph connects these dots to show how they’re related. This is like drawing lines between the dots to make a clear picture.

2. Understanding the Big Picture: In complex situations, just knowing the facts (like numbers and dates) isn’t enough. You need to understand how these facts affect each other. Context Graphs show these relationships, helping you see the whole story.

3. Finding Hidden Patterns: Sometimes, important insights are hidden in the way different pieces of data are connected. Context Graphs can reveal these patterns. For example, in a business, you might discover that when more people visit your website (one piece of data), sales in a certain region go up (another piece of data). Without seeing the connection, you might miss this insight.

4. Quick Problem-Solving: When something goes wrong, like a drop in product quality, a Context Graph can quickly show where the problem might be coming from. It connects data from different parts of the process (like raw material quality, production dates, and supplier information) to help find the source of the issue.

5. Better Predictions and Decisions: By understanding how different pieces of data are connected, businesses can make smarter predictions and decisions. For example, they can forecast which product combo will be popular in the future or decide where to invest their resources for the best results.

6. Enhancing Artificial Intelligence and Machine Learning: Context Graphs feed AI and machine learning systems with rich, connected data. This helps these systems make more accurate and context-aware decisions, like identifying fraud in financial transactions or personalizing recommendations for customers.

The power of Context Graphs in solving complex business problems is clear. By illuminating hidden connections and patterns in data, these graph-based structures offer a new approach to decision-making and problem-solving. From E’s product quality crisis to broader applications in predictive analytics and AI, Context Graphs are changing how businesses understand and utilize their data.

In Part 2 of this series, we’ll delve deeper into the practical aspects, exploring a framework approach to implementing these powerful graph structures in your organization.

The post Connected Context: Introducing Product Knowledge Graphs for Smarter Business Decisions appeared first on Tiger Analytics.

]]>
Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution https://www.tigeranalytics.com/perspectives/blog/solving-merchant-identity-extraction-in-finance-snowparks-data-engineering-solution/ Fri, 26 Jul 2024 10:07:10 +0000 https://www.tigeranalytics.com/?post_type=blog&p=22999 Learn how a fintech leader solved merchant identification challenges using Snowpark and local testing. This case study showcases Tiger Analytics' approach to complex data transformations, automated testing, and efficient development in financial data processing. Discover how these solutions enhanced fraud detection and revenue potential.

The post Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution appeared first on Tiger Analytics.

]]>
In the high-stakes world of financial technology, data is king. But what happens when that data becomes a labyrinth of inconsistencies? This was the challenge faced by a senior data engineer at a leading fintech company.

“Our merchant identification system is failing us. We’re losing millions in potential revenue and our fraud detection is compromised. We need a solution, fast.”

The issue was clear but daunting. Every day, their system processed millions of transactions, each tied to a merchant. But these merchant names needed to be more consistent. For instance, a well-known retail chain might be listed as its official name, a common abbreviation, a location-specific identifier, or simply a generic category. This inconsistency was wreaking havoc across the business.

Initial Approach: Snowflake SQL Procedures

Initially, the data engineer and his team developed Snowflake SQL procedures to handle this complex data transformation. While these procedures worked they wanted to add the automated testing pipelines and quickly realized the limitations. “We need more robust regression and automated testing capabilities. And we need to implement these tests without constantly connecting to a Snowflake account.” This capability wasn’t possible with traditional Snowflake SQL procedures, pushing them to seek external expertise.

Enter Tiger Analytics: A New Approach with Snowpark and Local Testing Framework

After understanding the challenges, the Tiger team proposed a solution: leveraging Snowpark for complex data transformations and introducing a local testing framework. This approach aimed to solve the merchant identity issue and improve the entire data pipeline process.

To meet these requirements, the team turned to Snowpark. Snowpark enabled them to perform complex data transformations and manipulations within Snowflake, leveraging the power of Snowflake’s computational engine. However, the most crucial part was the Snowpark Python Local Testing Framework. This framework allowed the team to develop and test their Snowpark DataFrames, stored procedures, and UDFs locally, fulfilling the need for regression testing and automated testing without connecting to a Snowflake account.

Key Benefits

  • Local Development: The team could develop and test their Snowpark Python code without a Snowflake account. This reduced the barrier to entry and sped up their iteration cycles.
  • Efficient Testing: By utilizing familiar testing frameworks like PyTest, the team integrated their tests seamlessly into existing development workflows.
  • Enhanced Productivity: The team quickly iterated on their code with local feedback, enabling rapid prototyping and troubleshooting before deploying to their Snowflake environment.

Overcoming Traditional Unit Testing Limitations

In the traditional sense of unit testing, Snowpark does not support a fully isolated environment independent of a Snowflake instance. Typically, unit tests would mock a database object, but Snowpark lacks a local context for such mocks. Even using the create_dataframe method requires Snowflake connectivity.

The Solution with Local Testing Framework

Despite these limitations, the Snowpark Python Local Testing Framework enabled the team to create and manipulate DataFrames, stored procedures, and UDFs locally, which was pivotal for our use case. Here’s how the Tiger team did it:

Setting Up the Environment

First, set up a Python environment:

pip install "snowflake-snowpark-python[localtest]"
pip install pytest

Next, create a local testing session:

from snowflake.snowpark import Session
session = Session.builder.config('local_testing', True).create()

Creating Local DataFrames

The Tiger team created DataFrames from local data sources and operated on them:

table = 'example'
session.create_dataframe([[1, 2], [3, 4]], ['a', 'b']).write.save_as_table(table)

Operating on these DataFrames was straightforward:

df = session.create_dataframe([[1, 2], [3, 4]], ['a', 'b'])
res = df.select(col('a')).where(col('b') > 2).collect()
print(res)

Creating UDFs and Stored Procedures

The framework allowed the team to create and call UDFs and stored procedures locally:

from snowflake.snowpark.functions import udf, sproc, call_udf, col
from snowflake.snowpark.types import IntegerType, StringType

@udf(name='example_udf', return_type=IntegerType(), input_types=[IntegerType(), IntegerType()])
def example_udf(a, b):
    return a + b

@sproc(name='example_proc', return_type=IntegerType(), input_types=[StringType()])
def example_proc(session, table_name):
    return session.table(table_name)\
        .with_column('c', call_udf('example_udf', col('a'), col('b')))\
        .count()

# Call the stored procedure by name
output = session.call('example_proc', table)

Using PyTest for Efficient Testing

The team leveraged PyTest for efficient unit and integration testing:

PyTest Fixture

In the conftest.py file, the team created a PyTest fixture for the Session object:

import pytest
from snowflake.snowpark.session import Session

def pytest_addoption(parser):
    parser.addoption("--snowflake-session", action="store", default="live")

@pytest.fixture(scope='module')
def session(request) -> Session:
    if request.config.getoption('--snowflake-session') == 'local':
        return Session.builder.configs({'local_testing': True}).create()
    else:
        snowflake_credentials = {} # Specify Snowflake account credentials here
        return Session.builder.configs(snowflake_credentials).create()
Using the Fixture in Test Cases
from project.sproc import my_stored_proc
  
def test_create_fact_tables(session):
    expected_output = ...
    actual_output = my_stored_proc(session)
    assert expected_output == actual_output
Running Tests

To run the test suite locally:

pytest --snowflake-session local

To run the test suite against your Snowflake account:

pytest

Addressing Unsupported Functions

Some functions were not supported in the local testing framework. For these, the team used patch functions and MagicMock:

Patch Functions

For unsupported functions like upper(), used patch functions:

from unittest.mock import patch
from snowflake.snowpark import Session
from snowflake.snowpark.functions import upper
  
session = Session.builder.config('local_testing', True).create()
  
@patch('snowflake.snowpark.functions.upper')
def test_to_uppercase(mock_upper):
    mock_upper.side_effect = lambda col: col + '_MOCKED'
    df = session.create_dataframe([('Alice',), ('Bob',)], ['name'])
    result = df.select(upper(df['name']))
    collected = result.collect()
    assert collected == [('Alice_MOCKED',), ('Bob_MOCKED',)]
  
MagicMock

For more complex behaviors like explode(), team used MagicMock:

from unittest.mock import MagicMock
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col

session = Session.builder.config('local_testing', True).create()

def test_explode_df():
    mock_explode = MagicMock()
    mock_explode.side_effect = lambda: 'MOCKED_EXPLODE'
    df = session.create_dataframe([([1, 2, 3],), ([4, 5, 6],)], ['data'])
    with patch('snowflake.snowpark.functions.explode', mock_explode):
        result = df.select(col('data').explode())
        collected = result.collect()
        assert collected == ['MOCKED_EXPLODE', 'MOCKED_EXPLODE']

test_explode_df()    
Scheduling Procedures Limitations

While implementing these solutions, the Tiger team faced issues with scheduling procedures using serverless tasks, so they used the Task attached to the warehouse. They created the Snowpark-optimized warehouse. The team noted that serverless tasks cannot invoke certain object types and functions, specifically:

  • UDFs (user-defined functions) that contain Java or Python code.
  • Stored procedures are written in Scala (using Snowpark) or those that call UDFs containing Java or Python code.

Turning Data Challenges into Business Insights

The journey from the initial challenge of extracting merchant identities from inconsistent transaction data to a streamlined, efficient process demonstrates the power of advanced data solutions. The Tiger team leveraged Snowpark and its Python Local Testing Framework, not only solving the immediate problem but also enhancing their overall approach to data pipeline development and testing. The combination of regex-based, narration’s pattern-based, and ML-based methods enabled them to tackle the complexity of unstructured bank statement data effectively.

This project’s success extends beyond merchant identification, showcasing how the right tools and methodologies can transform raw data into meaningful insights. For data engineers facing similar challenges, this case study highlights how Snowpark and local testing frameworks can significantly improve data application development, leading to more efficient, accurate, and impactful solutions.

The post Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution appeared first on Tiger Analytics.

]]>
A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric https://www.tigeranalytics.com/perspectives/blog/a-comprehensive-guide-to-pricing-and-licensing-on-microsoft-fabric/ Mon, 01 Jul 2024 12:13:32 +0000 https://www.tigeranalytics.com/?post_type=blog&p=22659 This comprehensive guide explores Microsoft Fabric's pricing strategies, including capacity tiers, SKUs, and tenant hierarchy, helping organizations optimize their data management costs. It breaks down the differences between reserved and pay-as-you-go models, explaining Capacity Units (CUs) and providing detailed pricing information. By understanding these pricing intricacies, businesses can make informed decisions to fully leverage their data across various functions, leading to more efficient operations and better customer experiences.

The post A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric appeared first on Tiger Analytics.

]]>
Organizations often face challenges in effectively leveraging data to streamline operations and enhance customer satisfaction. Siloed data, complexities associated with ingesting, processing, and storing data at scale, and limited collaboration across departments can hinder a company’s ability to make informed, data-driven decisions. This can result in missed opportunities, inefficiencies, and suboptimal customer experiences.

Here’s where Microsoft’s new SaaS platform “Microsoft Fabric” can give organizations a much-needed boost. By integrating data across various functions, including data science (DS), data engineering (DE), data analytics (DA), and business intelligence (BI), Microsoft Fabric enables companies to harness the full potential of their data. The goal is to enable seamless sharing of data across the organization while simplifying all the key functions of Data Engineering, Data Science, and Data Analytics to facilitate quicker and better-informed decision-making at scale.

For enterprises looking to utilize Microsoft Fabric’s full capabilities, understanding the platform’s pricing and licensing intricacies is crucial, impacting several key financial aspects of the organization:

1. Reserved vs Pay-as-you-go: Understanding pay-as-you-go versus reserved pricing helps in precise budgeting and can affect both initial and long-term operational costs.
2. Capacity Tiers: Clear knowledge of capacity tiers allows for predictable scaling of operations, facilitating smooth expansions without unexpected costs.
3. Fabric Tenant Hierarchy: It is important to understand the tenant hierarchy as this would have a bearing on the organization’s need to buy capacity based on their unique needs.
4. Existing Power BI Licenses: For customers having existing Power BI, it is important to understand how to utilize existing licenses (free/pro/premium) and how it ties in with Fabric SKU.

At Tiger Analytics, our team of seasoned SMEs have helped clients navigate the intricacies of licensing and pricing models for robust platforms like Microsoft Fabric based on their specific needs.

In this blog, we will provide insights into Microsoft Fabric’s pricing strategies to help organizations make more informed decisions when considering this platform.

Overview of Microsoft Fabric:

Microsoft Fabric offers a unified and simplified cloud SaaS platform designed around the following ‘Experiences’:

  • Data Ingestion – Data Factory
  • Data Engineering – Synapse DE
  • Data Science – Synapse DS
  • Data Warehousing – Synapse DW
  • Real-Time Analytics – Synapse RTA
  • Business Intelligence – Power BI
  • Unified storage – OneLake

A Simplified Pricing Structure

Unlike Azure, where each tool has separate pricing, Microsoft Fabric simplifies this by focusing on two primary cost factors:

1. Compute Capacity: A single compute capacity can support all functionalities concurrently, which can be shared across multiple projects and users without any limitations on the number of workspaces utilizing it. You do not need to select capacities individually for Data Factory, Synapse Data Warehousing, and other Fabric experiences.

2. Storage: Storage costs are separate yet simplified, making choices easier for the end customer.

Microsoft Fabric

Understanding Fabric’s Capacity Structure

To effectively navigate the pricing and licensing of Microsoft Fabric, it is crucial to understand how a Fabric Capacity is associated with Tenant and Workspaces. These three together help organize the resources within an Organization and help manage costs and operational efficiency.

1. Tenant: This represents the highest organizational level within Microsoft Fabric, and is associated with a single Microsoft Entra ID. An organization could also have multiple tenants.

2. Capacity: Under each tenant, there are one or more capacities. These represent pools of compute and storage resources that power the various Microsoft Fabric services. Capacities provide capabilities for workload execution. These are analogous to horsepower for car engines. The more you provision capacity, the more workloads can be run or can be run faster.

3. Workspace: Workspaces are environments where specific projects and workflows are executed. Workspaces are assigned a capacity, which represents the computing resources it can utilize. Multiple workspaces can share the resources of a single capacity, making it a flexible way to manage different projects or departmental needs without the necessity of allocating additional resources for each new project/ department.

Microsoft Fabric
Microsoft Fabric
The figure above portrays the Tenant hierarchy in Fabric and how different organizations can provision capacities based on their requirements.

Understanding Capacity Levels, SKUs, and Pricing in Microsoft Fabric

Microsoft Fabric capacities are defined by a Stock Keeping Unit (SKU) that corresponds to a specific amount of compute power, measured in Capacity Units (CUs). A CU is a unit that quantifies the amount of compute power available.

Capacity Units (CUs) = Compute Power

As shown in the table below, each SKU (Fn) is represented with a CU. E.g. F4 is double in capacity as compared to F2 but is half that of F8.

The breakdown below shows the SKUs available for the West Europe region, showing both Pay As You Go and Reserved (1-year) pricing options:

Microsoft Fabric

Comparative table showing Fabric SKUs, CUs, associated PBI SKU, Pay-as-you-Go and Reserved pricing for a region.
1 CU pay-as-you-price at West EU Region = $0.22/hour
1 CU PAYGO monthly rate calculation: $0.22*730 =$160.6, F2 =$160.6*2=$321.2
1 CU RI monthly rate calculation: Round ($0.22* (1-0.405)*730*12,0)/12=~$95.557…F2 RI = ~$95.557…*2=~$191.11

Pricing Models Explained:

Pay As You Go: This flexible model allows you to pay monthly based on the SKU you select, making it ideal if your workload demands are uncertain. You can purchase more capacity or even upgrade/downgrade your capacity. You further get an option to pause your capacities to save costs.

Reserved (1 year): In this option, you pay reserved prices monthly. The reservation is for 1 year. The prices of reserved can give you a savings of around 40%. It involves no option to pause and is billed monthly regardless of capacity usage.

Storage Costs in Microsoft Fabric (OneLake)

In Microsoft Fabric, compute capacity does not include data storage costs. This means that businesses need to budget separately for storage expenses.

  • Storage costs need to be paid for separately.
  • Storage costs in Fabric (OneLake) are similar to ADLS (Azure Data Lake Storage).
  • BCDR (Business continuity Disaster recovery) charges are also included. This comes into play when Workspaces are deleted but some data needs to be extracted from the same.
  • Beyond this, there are costs for cache storage (for KQL DB)
  • There are also costs for the transfer of data between regions – which is known as Bandwidth pricing. More details are in this link.

Optimizing Resource Use in Microsoft Fabric: Understanding Bursting and Smoothing Techniques

Despite purchasing a capacity, your workload may demand higher resources in between.

For this, Fabric allows two methods to help with faster execution (burst) while flattening the usage over time (smooth) to maintain optimal costs.

  • Bursting: Bursting enables the use of additional compute resources beyond your existing capacity to accelerate workload execution. For instance, if a task normally takes 60 seconds using 64 CUs, bursting can allocate 256 CUs to complete the same task in just 15 seconds.
  • Smoothing: Smoothing is applied automatically in Fabric across all capacities to manage brief spikes in resource usage. This method distributes the compute demand more evenly over time, which helps in avoiding extra costs that could occur with sudden increases in resource use.

Understanding Consumption: Where do your Computation Units (CUs) go?

Microsoft FabricImage credit: Microsoft

The following components in Fabric consume or utilize the CU (Capacity Units)

  • Data Factory Pipelines
  • Data Flow Gen2
  • Synapse Warehouse
  • Spark Compute
  • Event Stream
  • KQL Database
  • OneLake
  • Copilot
  • VNet Data Gateway
  • Data Activator (Reflex)
  • PowerBI

The CU consumption depends on the solution implemented for functionality. Here’s an example for better understanding:

Business Requirement: Ingest data from an on-prem data source and use it for Power BI reporting.

Solution Implemented: Data Factory pipelines with Notebooks to perform DQ checks on the ingested data. PowerBI reports were created pointing to the data in One Lake.

How are CU’s consumed:

CUs would be consumed every time the data factory pipeline executes and further invokes the Notebook (Spark Compute) to perform data quality checks.

Further, CU’s would get consumed whenever the data refreshes on the dashboard.

Microsoft Fabric Pricing Calculator:

Microsoft has streamlined the pricing calculation with its online calculator. By selecting your region, currency, and billing frequency (hourly or monthly), you can quickly view the pay-as-you-go rates for all SKUs. This gives you an immediate estimate of the monthly compute and storage costs for your chosen region. Additionally, links for reserved pricing and bandwidth charges are also available.

For more detailed and specific pricing analysis, Microsoft offers an advanced Fabric SKU Calculator tool through partner organizations.

Understanding Fabric Licensing: Types and Strategic Considerations

Licensing in Microsoft Fabric is essential because it legally permits and enables the use of its services within your organizational framework, ensuring compliance and tailored access to various functionalities. Licensing is distinct from pricing, as licensing outlines the terms and types of access granted, whereas pricing involves the costs associated with these licenses.

There are two types of licensing in Fabric:

  • Capacity-Based Licensing: This licensing model is required for operating Fabric’s services, where Capacity Units (CUs) define the extent of compute resources available to your organization. Different Stock Keeping Units (SKUs) are designed to accommodate varying workload demands, ranging from F2 to F2048. This flexibility allows businesses to scale their operations up or down based on their specific needs.
  • Per-User Licensing: User-based licensing was used in Power BI, and this has not changed in Fabric (for compatibility). The User accounts include:
    • Free
    • Pro
    • Premium Per User (PPU)

Each tailored to specific sets of capabilities as seen in the table below:

Microsoft Fabric
Image Credit: Microsoft (https://learn.microsoft.com/en-us/fabric/enterprise/licenses)

Understanding Licensing Scenarios

To optimally select the right Fabric licensing options and understand how they can be applied in real-world scenarios, it’s helpful to look at specific use cases within an organization. These scenarios highlight the practical benefits of choosing the right license type based on individual and organizational needs.

Scenario 1: When do you merely require a Power BI Pro License?

Consider the case of Sarah, a data analyst whose role involves creating and managing Power BI dashboards used organization-wide. These dashboards are critical for providing the leadership with the data needed to make informed decisions. In such a scenario, a Pro License is best because it allows Sarah to:

  • Create and manage Power BI dashboards within a dedicated workspace.
  • Set sharing permissions to control who can access the dashboards.
  • Enable colleagues to build their visualizations and reports from her Power BI datasets, fostering a collaborative work environment.

In the above scenario, a Pro license would suffice (based on the above-listed requirements.)

Scenario 2: What are the Licensing Options for Small Businesses?*

Consider a small business with about 60 users that wants to leverage premium Power BI features (pls. refer to the comparison table above which shows the capabilities for free, pro, and PPU (premium per user) to enhance its data analysis capabilities. The company has two primary licensing options within Microsoft Fabric to accommodate its needs, each with different cost implications and service access levels.

Option 1: Premium Per User (PPU) Licensing

  • This option involves purchasing a Premium Per User license for each of the 60 users.
  • Cost Calculation: 60 users x $20 per month = $1,200 per month.
  • Note: This option does not include any Fabric services or capacities; it only covers the Power BI Premium features.

Option 2: Combining F4 Capacity with Power BI Pro Licenses

  • Alternatively, the company can opt for a combination of an F4 Fabric capacity and 60 Power BI Pro licenses.
  • Cost Calculation: F4 capacity at $525 per month + (60 Power BI Pro licenses x $10 = $600) = $1,125 per month. Additional storage and other service costs may apply.
  • Benefits: This option is not only more cost-effective compared to Option 1, but it also provides access to broader Fabric services beyond just Power BI, enhancing the organization’s overall data management capabilities.

Option 2 offers a more economical and service-inclusive approach. Furthermore, it opens up opportunities to scale up using higher Fabric capacities with reserved (1-year) pricing for even greater efficiency and cost savings in the long run.

Microsoft Fabric
Table: Fabric SKU and Power BI SKUs for reference calculations and comparisons

Scenario 3: A Medium business organization is looking to implement analytics solutions using Fabric services and reporting using Power BI. They are also looking to share Power BI content for collaborative decision-making. What are the licensing options in Fabric?

Considerations:

1. Since the organization is looking to share Power BI content, you will need Power BI premium or equivalent Fabric capacities (F64 and above)
2. Microsoft is transitioning/enabling Power BI premium capacities to automatically be Fabric capacities – which brings more flexibility for organizations while keeping costs the same (when compared with PPU licenses)
3. It would be wise to start with F64 Pay-As-You-Go initially, check for performance and other factors such as bursting in the monthly bills, and finally decide on the final Fabric capacity with reserved pricing to avail up to 40% savings.

Scenario 4: An organization is looking to use Co-Pilot extensively across the platform. What Fabric capacity can they start with?

Considerations: A minimum of F64 SKU is required to be able to use Co-Pilot.

The table above provides a reference for understanding how different SKUs align with specific user needs and organizational goals, helping to further clarify the most effective licensing strategies for various roles within a company.

Key Considerations for selecting the right Fabric SKU and License

Now that we have seen some practical scenarios related to making licensing decisions, let us list out the key considerations for selecting the optimal Fabric SKU and license:

  • Organization Size & Usage Patterns:
    • A large organization with diverse data needs will likely require a higher-capacity SKU and more user licenses. Consider a mix of per-user and capacity licenses – analyze which teams work heavily in Fabric vs. those who are light consumers.
    • If your organization already uses Power BI extensively, or it’s central to your use of Fabric, having at least one Pro or PPU license is essential.
  • Workload Types and Frequency:
    • Batch vs. real-time processing: One-time bulk data migrations might benefit from short-term bursts, while continuous streaming needs consistent capacity.
    • Complexity of transformations: Resource-intensive tasks like complex data modeling, machine learning, or large-scale Spark jobs will consume more CUs than simple data movement.
    • Frequency of Power BI Use: Frequent dataset refreshes and report queries in Power BI significantly increase compute resource consumption.
    • Content Sharing/ CoPilot usage: To share the Power BI content freely across the organization or in order to use CoPilot, you must be on a minimum F64 or higher SKUs.
  • Operational Time:
    • Pay-as-you-go v/s Reserved (1-year) pricing: Reserved capacity locks in a price for consistent usage, while pay-as-you-go is better for sporadic workloads. The Reserved licensing provides roughly about 40% savings over the Pay-as-you-Go.
    • Pausing: You can pause your existing pay-as-you-go license when the capacity is not in use, resulting in cost savings.
    • Development vs. production: Dev environments can often use lower tiers or be paused when idle to reduce costs.
  • Region:
    • Costs vary by Azure region. Align your Fabric deployment with your primary user location to minimize data egress charges.
  • Power BI Premium: While Power BI licenses have not changed in Fabric, it is important to consider that the Power BI premium license would be merged with Fabric (F) licenses. The Free and Pro licenses would not be impacted.
  • Mixed Use: You may need to consider purchasing both Fabric (capacity) and Power BI licenses for sharing content across the organization.

How to Bring These Factors into Your Planning

Before beginning the Fabric deployment, consider these steps to ensure you choose the right SKU and licensing options:

  • Start with Baselining: Before scaling up, run pilot workloads to understand your capacity unit (CU) consumption patterns. This helps in accurately predicting resource needs and avoiding unexpected costs.
  • Estimate Growth: Project future data volumes, user counts, and evolving analytics needs. This foresight ensures that your chosen capacity can handle future demands without frequent upgrades.
  • Right-size, Don’t Overprovision: Initially, select an SKU that slightly exceeds your current needs. Microsoft Fabric’s flexibility allows you to scale up as necessary, preventing unnecessary spending on excess capacity.
  • Regularly Monitor Usage: Utilize the Capacity Metrics App to track resource usage and identify trends. This ongoing monitoring allows for timely adjustments and optimization of your resource allocation, ensuring cost-effectiveness.

Power BI Capacity Metrics App: Your Cost Control Center in Fabric

The Power BI Capacity Metrics App is an essential tool for understanding how different Microsoft Fabric components consume resources. It provides

  • Detailed reports and visualizations on the usage of computing and storage.
  • Empowers you to identify cost trends, potential overages, and optimization opportunities.
  • Helps you to stay within your budget.

Microsoft Fabric

Microsoft Fabric has streamlined licensing and pricing options, offering significant benefits at both capacity and storage levels:

Capacity Benefits
Microsoft Fabric
Image credits: Microsoft

Storage Benefits
Microsoft Fabric

In this blog, we’ve explored the intricacies of Microsoft Fabric’s pricing and licensing, along with practical considerations for making informed purchase decisions. If you want to integrate Fabric into your business, you can purchase the capacities and licenses from Azure Portal or reach out to us in case you need to discuss your use case.

The post A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric appeared first on Tiger Analytics.

]]>
Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh https://www.tigeranalytics.com/perspectives/blog/advanced-data-strategies-in-power-bi-a-guide-to-integrating-custom-partitions-with-incremental-refresh/ Fri, 03 May 2024 05:35:55 +0000 https://www.tigeranalytics.com/?post_type=blog&p=21623 Explore advanced data management strategies in Power BI through a detailed examination of integrating Custom Partitions with Incremental Refresh to efficiently handle large datasets. Key benefits such as improved query performance, more efficient data refresh, and better data organization are outlined, along with a practical guide on implementing these strategies in Power BI environments.

The post Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh appeared first on Tiger Analytics.

]]>
D, a data engineer with a knack for solving complex problems, recently faced a challenging task. A client needed a smart way to manage their data in Power BI, especially after acquiring new companies. This meant separating newly acquired third-party data from their existing internal data, while also ensuring that historical data remained intact and accessible. The challenge? This process involved refreshing large data sets, sometimes as many as 25 million rows for a single year, just to incorporate a few thousand new entries. This task was not just time-consuming but would also put a strain on computational resources.

At first glance, Power BI’s Custom Partitions seemed like a promising solution. It would allow D to organize data neatly, separating third-party data from internal data as the client wanted. However, Power BI typically partitions data by date, not by the source or type of data, which made combining Custom Partitions with Incremental Refresh—a method that updates only recent changes rather than the entire dataset—a bit of a puzzle.

Limitations of Custom Partition and Incremental Refresh in Power BI

Custom Partitions offer the advantage of dividing the table into different parts based on the conditions defined, enabling selective loading of partitions during refreshes. However, Power BI’s built-in Incremental Refresh feature, while automated and convenient, has limitations in terms of customization. It primarily works on date columns, making it challenging to partition the table based on non-date columns like ‘business region’.

Partition

Incremental Refresh Pros:

  • Creation of partitions is automated, and the updation of partitions based on date is also automated, no manual intervention is needed.

Incremental Refresh Cons:

  • Cannot have two separate logics defined for partition of data based on flag column.
  • Cannot support the movement of data using the Power BI Pipeline feature.

Custom Partitions Pros:

  • Can create partitions of our own logical partitions.
  • Can support the movement of data using the Power BI Pipeline Feature.

Custom Partitions Cons:

  • All the processes should be done manually.

To tackle these challenges, D came up with another solution. By using custom C# scripts and Azure Functions, D found a way to integrate Custom Partitions with an Incremental Refresh in the Power BI model. This solution not only allowed for efficient management of third-party and internal data but also streamlined the refresh process. Additionally, D utilized Azure Data Factory to automate the refresh process based on specific policies, ensuring that data remained up-to-date without unnecessary manual effort.

This is how we at Tiger Analytics, solved our client’s problem and separated third-party data. In this blog, we’ll explore the benefits of combining Custom Partitions with Incremental Refresh. Based on our experiences, how this combination can enhance data management in Power BI and provide a more efficient and streamlined approach to data processing.

Benefits of combining Incremental Refresh with Custom Partitions in Power BI

Merging the capabilities of Incremental Refresh with Custom Partitions in Power BI offers a powerful solution to overcome the inherent limitations of each approach individually. This fusion enables businesses to fine-tune their data management processes, ensuring more efficient use of resources and a tailored fit to their specific data scenarios.

Leveraging tools like Azure Function Apps, the Table Object Model (TOM) library, and Power BI’s XMLA endpoints, automating the creation and management of Custom Partitions becomes feasible. This automation grants the flexibility to design data partitions that meet precise business needs while enjoying the streamlined management and automated updates provided by Power BI.

Fact Sale

Optimizing Query Performance:

  • Custom Partitions improve query performance by dividing data into logical segments based on specific criteria, such as a flag column.
  • When combined with an Incremental Refresh, only the partitioned data that has been modified or updated needs to be processed during queries.
  • This combined approach reduces the amount of data accessed, leading to faster query response times and improved overall performance.

Efficient Data Refresh:

  • Incremental Refresh allows Power BI to refresh only the recently modified or added data, reducing the time and resources required for data refreshes.
  • When paired with Custom Partitions, the refresh process can be targeted to specific partitions, rather than refreshing the entire dataset.
  • This targeted approach ensures that only the necessary partitions are refreshed, minimizing processing time and optimizing resource utilization.

Enhanced Data Organization and Analysis:

  • Custom Partitions provide a logical division of data, improving data organization and making it easier to navigate and analyze within the data model.
  • With Incremental Refresh, analysts can focus on the most recent data changes, allowing for more accurate and up-to-date analysis.
  • The combination of Custom Partitions and Incremental Refresh enables more efficient data exploration and enhances the overall data analysis process.

Scalability for Large Datasets:

  • Large datasets can benefit significantly from combining Custom Partitions and Incremental Refresh.
  • Custom Partitions allow for efficient querying of specific data segments, reducing the strain on system resources when dealing with large volumes of data.
  • Incremental Refresh enables faster and more manageable updates to large datasets by focusing on the incremental changes, rather than refreshing the entire dataset.

Implementation Considerations:

  • Combining Custom Partitions and Incremental Refresh may require a workaround, such as using calculated tables and parameters.
  • Careful planning is necessary to establish relationships between the partition table, data tables, and Incremental Refresh configuration.
  • Proper documentation and communication of the combined approach are essential to ensure understanding and maintainability of the solution.

How to implement Incremental Refresh and Custom Partitions: A step-by-step guide

Prerequisites:

Power BI Premium Capacity or PPU License: The use of XMLA endpoints, which are necessary for managing Custom Partitions, is limited to Power BI Premium capacities. Alternatively, you can utilize Power BI premium-per-user (PPU) licensing to access these capabilities.
PPU: https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-per-user-faq
Xmla Reference: https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-connect-tools

Dataset Published to Premium Workspace: The dataset for implementing Custom Partitions and Incremental Refresh should be published to a Power BI Premium workspace.

Permissions for Azure Functions and Power BI Admin Portal: To automate the creation and management of Custom Partitions, you need the appropriate permissions. This includes the ability to create and manage Azure Functions and the necessary rights to modify settings in Power BI’s Admin portal.

  • In the Function App, Navigate to Settings -> Identity and Turn On the system assigned.
  • Next, create a security group in Azure and add the function App as a member.
  • Go to Power BI, navigate to the Admin portal, and add the security group to the Admin API setting that allows service principles to use Power BI APIs.
  • Go to Workspace, Go for access, and Add the function as a member to the Workspace.

Check Incremental Refresh Policy: The Incremental Refresh policy needs to be false to create partitions on the table (through code).

Refresh Policy

Fulfilling these prerequisites will enable effective utilization of Custom Partitions and Incremental Refresh in Power BI.

Implementation at a glance:

Create an Azure Function with .NET as the Runtime Stack: Begin by adding the necessary DLL files for Power BI model creation and modification to the Azure Function console.

Connect to the Power BI Server Using C# Code: Establish a connection by passing the required connection parameters, such as the connection string and the table name where partitions need to be implemented. (C# code and additional details are available in the GitHub link provided in the note section).

Develop Code for Creating Partitions: Utilize the inbuilt functions from the imported DLL files to create partitions within the Power BI server.

Implement Code for Multiple Partitions: Use a combination of for-loop and if-conditions to develop code capable of handling multiple partitions.

There are two types of data partitions to consider based on the Flag value:

  • Flag Value ‘Y’ will be stored in a single partition, referred to as the ABC Partition.
  • Flag Value ‘N’ will be partitioned based on the date column, adhering to the incremental logic implemented. (Examples of partition naming include 2020, 2021, 2022, 202301, 202302, 202303, etc., up to 202312, 202401, 202402).

Check and Create ABC Partition if it does not exist: The initial step in the logic involves verifying the existence of the ABC Partition. If it does not exist, the system should create it.

Implement Logic Within the Looping Block:

  • The first action is to verify the existence of yearly partitions for the last three years. If any are missing, they should be created.
  • Next, combine all partitions from the previous year into a single-year partition.
  • Subsequently, create new partitions for the upcoming year until April.
  • Any partitions outside the required date range should be deleted.

Automate Partition Refresh with a Pipeline: Establish a pipeline designed to trigger the Azure function on the 1st of April of every year, aligning with the business logic.

Partition Logical flow:

Partition Logical flow

Step-by-Step implementation:

  • From the home dashboard, search for and select the Function App service. Enter all necessary details, review the configuration, and click ‘Create’.

    Create Function App

  • Configure the function’s runtime settings.

    Create Function App

  • Check the Required dll’s

    Required dll's

  • Navigate to Development Tools -> Advanced Tools -> Go

    Advanced Tools
    Kudu Plus

  • Give CMD -> site -> wwwroot -> new function name -> bin and paste all dll’s

    dll's

  • The primary coding work, including the creation of partitions, is done in the run.csx file. This is where you’ll write the C# code.

    pbi-dataset-parttition

  • The Partitions should be as below:

    Fact Scale
    pbi-dataset-parttition
    Input body to the function:

    {
        "connectionstring":"Connection details",
        "datasource": "datasource name",
        "workspace": "workspace name",
        "dataset": "dataset name",
        "table": "Table name",
        "partition": "Y",
        "sourceschema": "Schema name",
        "sourceobject": "Source object Name table or view name",
        "partitionstatement": "Year",
        "history": "2"
    }
    

Refresh the selected partition using Azure Pipeline:

Azure Pipeline

  • Create Azure Pipeline, which uses web activity to call the Rest-API refresh method in the Power BI model.
  • The first step for using the Pipeline is to have the APP registered with Power Bi workspace and Model access.
  • Then, with the APP, get the AAD Token for authentication.
  • With the AAD Token, use the In-Built refresh POST methods in Rest-API for refreshing the required table and partition.
  • To make the Pipeline wait till the refresh is complete, use the In-Built refreshes GET methods in Rest-API. Implementing GET methods within the pipeline to monitor the refresh status, ensures the process completes successfully.
  • The Pipeline is built in a modular way, where workspaceID and DatasetID and Table name and partition name are passed.
  • The pipeline can call any model refresh until the API used in the Pipeline has access to the Model and Workspace.
    Pipeline Flow

    What does each activity in the pipeline mean:
    • Get Secret from AKV: This block of pipeline accepts the key vault URL and the secret name which has a secret for an app used to access Power BI. The output of this block is a secret value.
    • Get AAD Token: This block accepts the input of tenant id, app id, and output of Get Secret from AKV which gives an output as a token through which enables access to the Power BI model.
    • Get Dataset Refresh: This block accepts the input of workspace id, dataset id, body, and then token which we get from the 2nd block then this block triggers the refresh of the corresponding table and partitions that are passed through the body for the model. This block will follow the post method.

    Until Refresh Complete:

    • Wait: To ensure the refresh completes, this block checks every 10 seconds.
    • Get Dataset: This involves inputting the workspace ID, dataset ID, and request body, following the GET method. The output is a list of ongoing refreshes on the model.
    • Set Dataset: Assigning the output of the previous block to a variable
      This block will run till the variable is not equal to unknown.
    • If Condition: This step checks if the refresh process has failed. If so, the pipeline’s execution is considered unsuccessful.

Refresh the selected partition using Azure Function:

  • Please follow the same steps as above from 1- 6 to create the Azure function for refresh.
  • In the code+test pane add the c# code shared in the github.

Model Table Refresh

Input body to the function:

{
    "workspace": "Workspace Name",
    "dataset": "Semantic Model Name",
    "tables": [
        {
            "name": "Table1RefreshallPartitions",
            "refreshPartitions": false
        },
        {
            "name": "Table2Refreshselectedpartitions",
            "refreshPartitions": true,
            "partitions": [
                "202402",
                "202403",
                "202404"
            ]
        }
    ]
}

Both Incremental Refresh and Custom Partitions in Power BI are essential for efficiently managing data susceptible to change within a large fact table. They allow you to optimize resource utilization, reduce unnecessary processing, and maintain control over partition design to align with your business needs. By combining these features, you can overcome the limitations of each approach and ensure a streamlined and effective data management solution.

References:

https://www.tackytech.blog/how-to-automate-the-management-of-custom-partitions-for-power-bi-datasets/

Note: Access the following GitHub link for Azure Function code, body which we pass for function and pipeline JSON files. Copy the JSON file inside the pipeline folder and paste that in adf pipeline by renaming the pipeline name as mentioned in the file, you will get the pipeline.

https://github.com/TirumalaBabu2000/Incremental_Refresh_and_Custom_Partition_Pipeline.git

The post Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh appeared first on Tiger Analytics.

]]>
Empowering BI through GenAI: How to address data-to-insights’ biggest bottlenecks https://www.tigeranalytics.com/perspectives/blog/empowering-bi-through-genai-how-to-address-data-to-insights-biggest-bottlenecks/ Tue, 09 Apr 2024 07:11:05 +0000 https://www.tigeranalytics.com/?post_type=blog&p=21174 Explore how integrating generative AI (GenAI) and natural language processing (NLP) into business intelligence empowers organizations to unlock insights from data. GenAI addresses key bottlenecks: enabling personalized insights tailored to user roles, streamlining dashboard development, and facilitating seamless data updates. Solutions like Tiger Analytics' Insights Pro leverage AI to democratize data accessibility, automate pattern discovery, and drive data-driven decision-making across industries.

The post Empowering BI through GenAI: How to address data-to-insights’ biggest bottlenecks appeared first on Tiger Analytics.

]]>
The Achilles’ heel of modern business intelligence (BI) lies in the arduous journey from data to insights. Despite the fact that 94% of business and enterprise analytics professionals affirm the critical role of data and analytics in driving digital transformation, organizations often struggle to extract the full value from their data assets.

Three Roadblocks on the Journey from Data-to-Insights

In our work with several Fortune 500 clients across domains, we’ve observed the path to actionable insights extracted from data is hindered by a trifecta of formidable bottlenecks that often prolong time to value for businesses.

  • The pressing need for personalized insights tailored to each user’s role
  • The escalating complexities of dashboard development, and
  • The constant stream of updates and modifications required to keep pace with evolving business needs

As companies navigate this challenging landscape, the integration of Generative AI (GenAI) into BI processes presents a promising solution, empowering businesses to unlock the true potential of their data and stay ahead in an increasingly competitive market.

Challenge 1: Lack of persona-based insights

Every user persona within an organization has different insight requirements based on their roles and responsibilities. Let’s look at real-world examples of such personas for a CPG firm:

  • CEOs seek insights into operational efficiency and revenue, focusing on potential risks and losses
  • Supply Chain Managers prioritize information about missed Service Level Agreements (SLAs) or high-priority orders that might face delays
  • Plant Managers are interested in understanding unplanned downtime and its impact on production

Hence, the ability to slice and dice data for ad-hoc queries is crucial for gaining technical know-how. However, the challenge lies in catering to these diverse needs while ensuring each user gets relevant insights tailored to their roles. Manual data analysis and reporting may not pass the litmus test, as it can be too time-consuming and may not be able to provide granularity as desired by the key stakeholders.

Challenge 2: Growing complexities of dashboard development

Creating multiple dashboards to meet the diverse needs of users requires a lot of time and effort. It typically involves extensive stakeholder discussions to understand their requirements, leading to extended development cycles. The process becomes more intricate as organizations strive to strike the right balance between customization and scalability. With each additional dashboard, the complexity grows, potentially leading to data silos and inconsistencies. Dependency on analysts for ad-hoc analysis also causes more delays in generating actionable insights. The backlog of ad-hoc requests can overwhelm the BI team, diverting their focus from strategic analytics.

Managing various dashboard versions, data sources, and user access permissions adds another layer of complexity, making it difficult to ensure consistency and accuracy.

Challenge 3: Too many updates and modifications

The relentless need to update and modify the dashboard landscape puts immense pressure on the BI teams, stretching their resources and capabilities. Rapidly shifting priorities and data demands can lead to a struggle to align with the latest strategic objectives. Also, constant disruptions to existing dashboards can create user reluctance and hinder the adoption of data-driven decision-making across the organization.

Plus, as businesses grow and evolve, their data requirements change. It leads to constant updates/modifications– triggering delays in delivering insights, especially when relying on traditional development approaches. As a result, the BI team is often overwhelmed with frequent requests.

Empowering BI through GenAI

What if anyone within the organization could effortlessly derive ad-hoc insights through simple natural language queries, eliminating the need for running complex queries or dependence on IT for assistance? This is where the integration of GenAI and NLP proves invaluable, streamlining information access for all key users with unparalleled ease and speed.

At Tiger Analytics we developed Insights Pro, a proprietary GenAI platform to overcome these challenges and deliver faster and more efficient data-to-insights conversions.

In a nutshell, by generating insights and streamlining BI workflows, Insights Pro takes on a new approach. Rather than contextualizing data using data dictionary, it leverages the power of LLMs for data dictionary analysis and prompt engineering, thus offering:

  • Versatility – Ensures superior data and domain-agnostic performance
  • Contextuality – Comes with an advanced data dictionary that understands column definitions and contexts based on session conversations
  • Scalability – Spans across different user and verticals

This democratizes access to data-driven insights, reducing the dependency on dedicated analysts. Whether it’s the CEO, Supply Chain Manager, or Plant Manager, they can directly interact with the platform to get the relevant insights on time and as needed.

Empowering Data-Driven Decision-Making | Applications across various industries

Logistics and Warehousing: AI powered BI solutions can assist in optimizing warehouse operations by analyzing shipment punctuality, fill rates, and comparing warehouse locations. It identifies areas for improvement, determines average rates, and pinpoints critical influencing factors to enhance efficiency and streamline processes.

Transportation: Transportation companies can evaluate carrier performance, identify reasons for performance disparities, and assess overall carrier efficiency. It provides insights into performance gaps, uncovers the causes of delays, and supports informed decision-making to optimize transportation networks.

Supply Chain Management: AI powered BI solution empowers supply chain leaders to identify bottlenecks, such as plants with the longest loading times, compare location efficiency, and uncover factors impacting efficiency. It guides leaders towards clarity and success in navigating the complexities of supply chain operations, facilitating data-driven optimization strategies.

Business Intelligence and Analytics: Analysts are equipped with a comprehensive view of key metrics across various domains, such as shipments across carriers, order-to-delivery times, and modeling to understand influencing factors. It bridges data gaps, simplifies complexities, and offers clarity in data analysis, enabling analysts to derive actionable insights and drive data-informed decision-making.

Undeniably, empowering BI through AI can only be achieved by knocking off time-consuming bottlenecks that hinder data-to-insights conversion.

Tiger Analytics’ Insights Pro also goes a long way to combat other challenges that Generative AI has been associated with at an enterprise level. For instance, it ensures that data Security concerns are uploaded to the GPT server as data dictionaries. It also delivers an up-to-date data dictionary so that new business terms shouldn’t be manually defined in the current session.

Looking ahead, NLP and GenAI-powered solutions will break down barriers to data accessibility, automate the discovery of hidden patterns empowering users across organizations to leverage data insights through natural language interactions. By embracing solutions like Insights Pro, businesses can unlock the value of their data, drive innovation, and shape a future where data-driven insights are accessible to all.

The post Empowering BI through GenAI: How to address data-to-insights’ biggest bottlenecks appeared first on Tiger Analytics.

]]>
Why Self-Service BI is Important in an Agile Business Scenario https://www.tigeranalytics.com/perspectives/blog/why-self-service-bi-is-important-in-an-agile-business-scenario-3/ Thu, 02 Feb 2023 18:33:25 +0000 https://www.tigeranalytics.com/?p=10826 Self-service BI empowers business users to independently query, visualize, and analyze data, fostering a data-driven culture through accessible learning resources. Read how organizations can harness it towards informed decision-making and collaboration among multiple stakeholders, ultimately enhancing communication channels and data utilization.

The post Why Self-Service BI is Important in an Agile Business Scenario appeared first on Tiger Analytics.

]]>
In organizations where data-driven solutions are valued, Data democratization is becoming a necessity.

Data virtualization and Data federation software act as enablers of Data democratization by eliminating an organization’s data silos and making the data accessible through virtual storage mediums.

Access to data at the right time and in the right manner is crucial for making data-driven decisions. However, classifying the data and granting relevant access to varied users has always been a challenge.

With traditional BI implementations, the responsibility for report development rests primarily on IT teams. As the number of stakeholders and the demand for accessible data increase, Self-Service BI can be regarded as a form of Data Democratization, equipping more business users to work on their own through monitored access profiles and data accessibility, easing the burden off single teams.

Self-Service BI enables business users to access and explore data by running their own queries, creating their own data visualizations, dashboards and reports, filtering, sorting, and analyzing data. The availability of online learning materials, self-paced learning, and access to resources has made it possible for users to feel comfortable using different data touchpoints and dashboards to make insightful decisions. Data democratization has ushered in a data-driven culture which means that access to data can now be shared by multiple stakeholders. All in all, we’re seeing improved communication channels and better collaboration.

Setting up a Self-Service BI

At Tiger Analytics, we partnered with a large-scale silicon component manufacturing company based out of the US, to implement a Self-Service BI solution.

The Higher-Order curated datasets that were built for self-service enabled tech-savvy business users to conduct ad-hoc analyses. This saved a lot of the time and effort that it would have taken them to build a traditional BI report/Dashboard. The users did not have to wait in a queue for their respective requirements, which also meant that they had access to the information they needed much earlier, enabling them to deliver quicker analysis and helping them generate reports faster.

The key advantage of this self-service analysis was that the business user could conduct ad-hoc analyses from time to time, focusing on their high-value priorities, and was now able to get faster results, on the go, rather than going through an entire report development life cycle.

self service enablement

Building a modern data platform that can handle data sources of variety and volume and that can support scalability and concurrency to manage business dynamics is of utmost importance.

While implementing a Self-Service BI within an organization, here are a few of our best practices that you can replicate:

  • Provide proper business names for tables, columns, and measures
  • Create a report catalog page for the users to find the reports and datasets
  • Add a data dictionary page to include definitions of data elements in the report
  • Build templates to create consistent report layouts by users
  • Display only the relevant tables – Hide other config or supporting tables
  • Build proper relationships between the tables since users can simply drag and drop fields across different tables.
  • Add the description to tables and fields for better understanding
  • Add required synonyms for the fields; if users use Q&A, then it will help them
  • Establish a governance committee to enable self-service for required end users
  • Create end-user training modules for effective use of self-service
  • Self-service should be limited to only a specific set of users
  • IT needs to monitor the self-service usage to avoid concurrency and performance issues
  • Make tools available to the end users for self-service enablement
  • Restrict publishing of ad-hoc reports in common workspaces
  • IT to ensure infrastructure is scalable and efficient for Self-service and Reporting needs

We’ve extracted and created shared datasets, created semantic layers on top of the data layer, and defined key data quality metrics, data management, access, and usage policies. The diagram below depicts various stages of the Self-Service BI life cycle.

self service bi lifecycle

The requirement gathering is done at the ‘line of business’ level, unlike traditional BI, where it’s at the ‘report’ level. This enables multiple user personas to use the same shared datasets.

Self-Service User hierarchy

Once we have the baseline requirements, it is imperative to group users into multiple user personas based on their skills and requirements. This will help in creating different roles and defining access for each group.

While working with our client, here’s how we segregated the users into four user personas.

1. Power User

The Power user is a technical user. They have access to the base tables in the database. Power users can create their own views by applying the filters on the tables/views in the database, and they know how to combine data from external files with the tables and create reports.

2. Ad-hoc query user

The Ad-hoc query user knows how to use Power BI. They can connect to curated datasets and create custom visualizations. They are also capable of creating custom calculations in Power BI and have the provision for sharing the report within their line of business.

3. Excel analyst

The Excel Analyst can connect to shared datasets in excel, create their own custom calculations in excel, and create pivot tables and charts in Excel.

4. End User

The End User has access to the Reports and Dashboards, can slice and dice data, filter data, and share bookmarks within the team.

self service user hierarchy

Self-Service User journey and process flow

Here’s how we’ve mapped the available features and the users, as shown below:

self service bi table

Once the users are defined, and the datasets are ready, a process flow needs to be defined to document the data access flow for the user. A report catalog page is created to organize all the available reports in various headers.

  • A data dictionary page is created in each report for the users to comprehend the data elements in the report.
  • User training sessions are set up for the Business users to train on report usability.
  • User is assigned a Developer License on the access request.
  • Users are allowed to review the catalog of shared reports and datasets.
  • If the required dataset is available, the user can create a new report on top of the dataset and publish to the department portal and share it with their respective department.
  • If the required dataset is not available, the user requests for the development and deployment of new datasets/reports which can be used by them.
  • The user can either share a report at the department level or can create a report to be used at the organizational level and share it in the Enterprise Report portal.

There can be follow-ups to review the progress and share tips and tricks, BI best practices, and templates.

self service bi best practices

One of the most common challenges we’ve seen with Self-Service BI is the lack of adoption by business users. The users might have difficulty in understanding how to use a report or a shared dataset available to them or might create reports with inaccurate analytical results. To ease these issues, it’s a good practice to institute Self-Service Office Hours as a forum for self-service users where the BI team can help users to understand what data is available to them and how they can access It.

The members of the BI team could make themselves available to help or support business users on an ongoing basis and are available for centralized monitoring.

With this, users get their questions answered, and that helps bridge the data literacy gap. This effort also enables collaboration among different teams within the organization. The users can then hop in and hop out of the session as required.

The Road ahead…

Regardless of the size of the organization, data availability is not enough. That data needs to be accessible so that the leadership can use that information to derive useful insights and craft meaningful strategic interventions.

Self-Service BI implementation empowers employees by giving them access to data. And even with our clients, this has considerably reduced the cost of report development, fast-tracked data-driven decisions, and improved collaboration within the organization.

As organizations and their needs continue to evolve, so does their self-service journey, making data-driven insights the new normal.

Read our other articles on BI.

The post Why Self-Service BI is Important in an Agile Business Scenario appeared first on Tiger Analytics.

]]>
Achieving IPL Fantasy Glory with Data-Backed Insights and Strategies https://www.tigeranalytics.com/perspectives/blog/ipl-fantasy-leaderboard-data-analysis/ Tue, 21 Sep 2021 16:22:35 +0000 https://www.tigeranalytics.com/?p=5742 A cricket enthusiast shares insights on building a winning IPL fantasy team. From data analysis tools such as Kaggle and Howstat to tips on player selection, venue analysis, and strategic gameplay, this guide emphasizes the role of statistics in making informed choices, ultimately highlighting the unpredictability of the sport.

The post Achieving IPL Fantasy Glory with Data-Backed Insights and Strategies appeared first on Tiger Analytics.

]]>
When you’re a die-hard cricket fan who watches almost every game, choosing the right players in IPL fantasy may seem as easy as knowing exactly what dish to order in a restaurant that has a massive menu. However, this wasn’t true in my case. In fact, this blog isn’t about me giving gyaan on the subject, but rather a reflection of my trials and errors over the last few years in putting together a team that could make a mark on the leaderboard – even if the slightest.

Of late though, I have been making conscious efforts to better my game, and seem to be doing fairly well now. This, however, was no easy task. My path became clearer and my efforts more fruitful when I was able to take data analytics along with me into this fantasy world.

So, from one enthusiast to the other, here are my two cents on what can help you create the right team based on numbers, the power of observation, and a little bit of luck.

Consistency is Key

The first and foremost point to keep in mind is finding consistent performers, and there are some tools that can help you determine who is on top of their game that season. Here are some of my obvious picks:

Suryakumar Yadav: My top pick will always be Yadav of the Mumbai Indians. In 2020, he collected a total of 400 runs in just 16 matches. He was consistent in 2018 and 2019 as well by amassing 400+ runs. Yadav made a name for himself from the very beginning of his cricket career, which can be further proved by his consistent performance for Team India as well.

Bhuvaneshwar Kumar: Kumar is a sure-shot player, and, as Michael Vaughn pointed out, he is also possibly the “smartest player” his generation has to offer. Even in ODI and T20 matches in the pre-IPL era, he was always able to out-smart his opponents. From the 151 matches he has played in the IPL, he has maintained an average of 24, and one can always expect wickets from him. Economically too, Kumar is a man to watch out for, as he is in the top five.

David Warner: No matter how the Sunrisers Hyderabad perform, Warner remains consistent, and has managed to remain the highest run-scorer for his team.

KL Rahul: Rahul, with an average of 44, is a consistent player who is also great at stitching together partnerships.

Playing by Numbers

After some years of bungling up, I realized that variables such as economy, average, past performance, etc. can be best understood using data analysis. I have found the platforms Kaggle and Howstat to be useful resources.

Kaggle, which serves as a home for a lot of challenging data science competitions, has an IPL dataset that has ball-by-ball details of all the IPL matches till 2020. That’s 12 years of granular data you can aggregate to get the metrics of your choice. You can get the dataset at this link: https://www.kaggle.com/patrickb1912/ipl-complete-dataset-20082020.

Howstat, on the other hand, has the most frequently used metrics laid out beautifully. Thanks to the folks with whom I play cricket, I came to know about this wonderful website.

Let’s talk about venues specifically, which, as you may know by now, play a critical role in the IPL fantasy world too, and can directly impact the kind of players you pick. In the early days (pre-2010), when Dada would walk onto the pitch at Eden Gardens, the crowd would roar because it was given that he would put up a good show on his home turf. But why take the risk and rely on gut and cheers where numbers can lead you to assured results? Especially when the stakes are so high and the competition so fierce.

Here is where I would look to data analysis to help me make a more informed decision. For example, if one were to refer to KL Rahul’s scores on Howstat, you can see that despite having played the most matches at Chinnaswamy Stadium (15 in total), Rahul’s average at theIS Bindra Stadium is much better. His average is 49.78 at IS Bindra, while at Chinnaswamy Stadium, his average is 38.22.

(Pro-tip: One player who comes to mind not just for his batting skills but also for his ability to perform well across pitches is Shikhar Dhawan. I would definitely include him in my team. He also secures tons of catches which adds to the points tally).

Now some of you may not have the time to sort through the many IPL datasets available on a platform such as Kaggle, which is understandable as even the best of us can be intimidated by numbers. One tip I have for you folk is to merely look at what the numbers point to on your app of choice. By looking at the percentage of people choosing a particular player on the Dream11 app, for example, you can understand which players are on top of their game that season.

This is best determined somewhere around the middle of the season, after around five-six matches, as that is when you will know who is at his peak and whom you can skip from your team.

The Non-Data Way

If you are struggling to make your way through all the numbers, I have some non-statistical tips too, which I learned to include in my game only after my many trials and tribulations.

1. It’s human nature to compare ourselves to others – you know how it goes, the grass is always greener and all that jazz. This leads to mimicry, and while at times it helps to follow in the footsteps of those you aspire to be (on the leaderboard in this case), unfortunately, in 20-20 fantasy, this doesn’t work. The best route is to work out your own strategy and make your own mistakes.

2. Make sure to use boosts and player-transfer options wisely in the initial stages. It’s only normal to want a good leaderboard score while starting out, but this could lead you to exhaust your transfer list very early on, leaving you with the same players through the season. This can also significantly bring down your score. Using sufficient transfers and boosts towards the business end of things (post 20 matches or so) can go a long way.

3. Using the uncapped player-transfer option is also worth exploring. This can reveal a whole range of players and talent from different countries, who haven’t played for Team India, but who are extremely skilled.

4. Coming to all-rounders – my tip would be to have three in your team. This is especially important while selecting your captain and vice-captain. For example, Chris Woakes is one all-rounder who has worked well for me this season before he left.

Use your gut, use your mind

What I can say for certain through this blog, is that nothing is certain in IPL fantasy cricket. Yet, while this may seem like the most unsatisfactory take-away, I can vouch for one thing – data analysis can definitely change your game for the better.

Of course, certain factors are out of our control. Injuries, fouls, poor weather, etc. are an inevitable part of any sport and could significantly change the outcome of a game. But if one dataset or one number-crunch can change how you view a match and give you better insight, wouldn’t that be something worth exploring? In Dhoni’s own words, ‘Bas dimaag laga ke khel’!

The post Achieving IPL Fantasy Glory with Data-Backed Insights and Strategies appeared first on Tiger Analytics.

]]>
Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins https://www.tigeranalytics.com/perspectives/blog/ml-powered-digital-twin-predictive-maintenance/ Thu, 24 Dec 2020 18:19:09 +0000 https://www.tigeranalytics.com/?p=4867 Tiger Analytics leverages ML-powered digital twins for predictive maintenance in manufacturing. By integrating sensor data and other inputs, we enable anomaly detection, forecasting, and operational insights. Our modular approach ensures scalability and self-sustainability, yielding cost-effective and efficient solutions.

The post Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins appeared first on Tiger Analytics.

]]>
Historically, manufacturing equipment maintenance has been done during scheduled service downtime. This involves periodically stopping production for carrying out routine inspections, maintenance, and repairs. Unexpected equipment breakdowns disrupt the production schedule; require expensive part replacements, and delay the resumption of operations due to long procurement lead times.

Sensors that measure and record operational parameters (temperature, pressure, vibration, RPM, etc.) have been affixed on machinery at manufacturing plants for several years. Traditionally, the data generated by these sensors was compiled, cleaned, and analyzed manually to determine failure rates and create maintenance schedules. But every equipment downtime for maintenance, whether planned or unplanned, is a source of lost revenue and increased cost. The manual process was time-consuming, tedious, and hard to handle as the volume of data rose.

The ability to predict the likelihood of a breakdown can help manufacturers take pre-emptive action to minimize downtime, keep production on track, and control maintenance spending. Recognizing this, companies are increasingly building both reactive and predicted computer-based models based on sensor data. The challenge these models face is the lack of a standard framework for creating and selecting the right one. Model effectiveness largely depends on the skill of the data scientist. Each model must be built separately; model selection is constrained by time and resources, and models must be updated regularly with fresh data to sustain their predictive value.

As more equipment types come under the analytical ambit, this approach becomes prohibitively expensive. Further, the sensor data is not always leveraged to its full potential to detect anomalies or provide early warnings about impending breakdowns.

In the last decade, the Industrial Internet of Things (IIoT) has revolutionized predictive maintenance. Sensors record operational data in real-time and transmit it to a cloud database. This dataset feeds a digital twin, a computer-generated model that mirrors the physical operation of each machine. The concept of the digital twin has enabled manufacturing companies not only to plan maintenance but to get early warnings of the likelihood of a breakdown, pinpoint the cause, and run scenario analyses in which operational parameters can be varied at will to understand their impact on equipment performance.

Several eminent ‘brand’ products exist to create these digital twins, but the software is often challenging to customize, cannot always accommodate the specific needs of each and every manufacturing environment, and significantly increases the total cost of ownership.

ML-powered digital twins can address these issues when they are purpose-built to suit each company’s specific situation. They are affordable, scalable, self-sustaining, and, with the right user interface, are extremely useful in telling machine operators the exact condition of the equipment under their care. Before embarking on the journey of leveraging ML-powered digital twins, certain critical steps must be taken:

1. Creation of an inventory of the available equipment, associated sensors and data.

2. Analysis of the inventory in consultation with plant operations teams to identify the gaps. Typical issues may include missing or insufficient data from the sensors; machinery that lacks sensors; and sensors that do not correctly or regularly send data to the database.

3. Coordination between the manufacturing operations and analytics/technology teams to address some gaps: installing sensors if lacking (‘sensorization’); ensuring that sensor readings can be and are being sent to the cloud database; and developing contingency approaches for situations in which no data is generated (e.g., equipment idle time).

4. A second readiness assessment, followed by a data quality assessment, must be performed to ensure that a strong foundation of data exists for solution development.

This creates the basis for a cloud-based, ML-powered digital twin solution for predictive maintenance. To deliver the most value, such a solution should:

  • Use sensor data in combination with other data as necessary
  • Perform root cause analyses of past breakdowns to inform predictions and risk assessments
  • Alert operators of operational anomalies
  • Provide early warnings of impending failures
  • Generate forecasts of the likely operational situation
  • Be demonstrably effective to encourage its adoption and extensive utilization
  • Be simple for operators to use, navigate and understand
  • Be flexible to fit the specific needs of the machines being managed

predictive maintenance cycle

When model-building begins, the first step is to account for the input data frequency. As sensors take readings at short intervals, timestamps must be regularized and resamples taken for all connected parameters where required. At this time, data with very low variance or too few observations may be excised. Model data sets containing sensor readings (the predictors) and event data such as failures and stoppages (the outcomes) are then created for each machine using both dependent and independent variable formats.

To select the right model for anomaly detection, multiple models are tested and scored on the full data set and validated against history. To generate a short-term forecast, gaps related to machine testing or idle time must be accounted for, and a range of models evaluated to determine which one performs best.

Tiger Analytics used a similar approach when building these predictive maintenance systems for an Indian multinational steel manufacturer. Here, we found that regression was the best approach to flag anomalies. For forecasting, the accuracy of Random Forest models was higher compared to ARIMA, ARIMAX, and exponential smoothing.

predictive maintenance analysis flow

Using a modular paradigm to build ML-powered digital twin makes it straightforward to implement and deploy. It does not require frequent manual recalibration to be self-sustaining, and it is scalable so it can be implemented across a wide range of equipment with minimal additional effort and time.

Careful execution of the preparatory actions is as important as strong model-building to the success of this approach and its long-term viability. To address the challenge of low-cost, high-efficiency predictive maintenance in the manufacturing sector, employ this sustainable solution: a combination of technology, business intelligence, data science, user-centric design, and the operational expertise of the manufacturing employees.

This article was first published in Analytics India Magazine.

The post Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins appeared first on Tiger Analytics.

]]>