Snowpark Archives - Tiger Analytics

Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management

onemg — Fri, 24 Jan 2025 13:10:32 +0000

As the maritime industry navigates tremendous volumes of data, the call for accelerated digitalization is stronger than ever. The maritime sector is a vast and intricate ecosystem where data flows continuously across interconnected sectors—from vessel management and maintenance to fuel optimization and emissions control. As the United Nations Conference on Trade and Development highlighted in its 2024 report, digital transformation through technologies like blockchain, artificial intelligence, and automation is crucial for improving port operations. Ports that have embraced these innovations report reduced waiting times, enhanced cargo tracking, and greater efficiency in transshipment processes.

In this data-intensive environment, operational data from ship-installed software is just the beginning. Third-party sources such as AIS data, weather information, and other cloud applications play a vital role in many maritime use cases. Traditionally, integrating this diverse data—often accessed via REST APIs—required external platforms like AWS Lambda or Databricks.

With Snowflake’s introduction of the External Access Integration feature, maritime organizations can now consolidate API data integration and data engineering workflows within a single, powerful platform. This breakthrough not only simplifies operations but also improves flexibility and efficiency.

Let’s discuss a use case

Suppose we need to retrieve crew rest and work hours data from a third-party regulatory service to generate near real-time, period-specific compliance reports for all vessels managed by a ship manager. These details are made available to the business through REST APIs.

Landscape Dissection and Data Enablement

Let’s assume Snowflake is the chosen cloud data warehouse platform, with Azure serving as the primary solution for data lake requirements. Operational data for vessels from various legacy systems and other sources is integrated into Snowflake. Data pipelines and models are then built on this integrated data to meet business needs. The operational data is ingested into Snowflake through a combination of Snowflake’s native data loading options and the replication tool Fivetran.

Challenges Explained

Outbound REST API calls must be made to retrieve crew rest and work hours data. The semi-structured data from the API response will need to undergo several transformations before it can be integrated with the existing vessel operational data in Snowflake. Additionally, the solution must support the near real-time requirements of the compliance report. The new pipeline should seamlessly align with the current data pipelines for ingestion and transformation, ensuring no disruptions to existing processes.

We now explore Snowflake’s external access integration to address these challenges.

What is Snowflake’s External Access Integration?

Snowflake’s External Access Integration empowers businesses to integrate the data seamlessly from diverse external sources and networks, helping them bridge data gaps and providing a holistic view for better decisions. The feature gives users the flexibility to read external data and integrate only which is necessary for the use case while the majority of the data resides at the source. Key benefits of this feature include:

Enabling real time access to complex third-party data providers
Eliminating data duplication
Enriching data with selective data integration that benefits your use case
Enhanced data-driven decision making

Leveraging Snowflake’s External Access Integration: A Step-by-Step Guide

Here is a complete walkthrough of the procedures to solve our use case:

Step 1: Creating Network Rule

Snowflake enables its accounts to selectively and securely access databases or services via its network rules. This enhances the security by limiting the list of IPs that can connect to Snowflake.
CREATE NETWORK RULE command helps us to add the list of APIs that Snowflake account should connect to.

CREATE [OR REPLACE] NETWORK RULE 
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ()

Step 2: Creating Secret

Securely save the credentials to be used while authenticating to APIs via secrets in Snowflake.
CREATE SECRET command is used to represent the credentials such as username and password, which are used to authenticate the API we have added to the network rule in step 1.

Basic Authentication
CREATE [ OR REPLACE ] SECRET 
TYPE = PASSWORD
USERNAME = ''
PASSWORD = ''

Step 3: Creating External Access Integration

Specify the network rule and secrets used to connect to the APIs via external access integration.
CREATE EXTERNAL ACCESS INTEGRATION command aggregates the allowed network rule and secrets to securely use in UDFs or procedures.

CREATE [ OR REPLACE ] EXTERNAL ACCESS INTEGRATION 
ALLOWED NETWORK RULES = 
ENABLED = TRUE

Step 4: External Call

There are multiple methods to call external APIs – UDFs or procedures or direct calls from Snowflake Notebooks (Preview Feature as of now). Let’s explore Snowflake Notebooks to make external calls via Python. Snowflake Notebooks offer an interactive environment to code your logics in SQL or Python.

To make API calls from a particular notebook, enable the created external access integration in step 3 in your notebook. This can be done from the ‘Notebook settings’ options available for the Snowflake notebooks.
After importing required libraries, call the required APIs and save the response object.
Leverage Snowflake Snowpark framework to operate on the data frames and save your results to Snowflake tables.
Use Snowflake’s native functions to flatten and format the semi structured data that is mostly received as a response from the API calls.
The transformed data via API can be further combined with the operational or modeled data in Snowflake.

Configuration: Creating a network rule and external access integration.

create OR replace network RULE NW_RULE_PUBLIC_API
mode = egress
type = host_port
value_list = ('geocoding-api.open-meteo.com')

create or replace external access integration EAI_PUBLIC_API
allowed_network_rules = (NW_RULE_PUBLIC_API)
enabled = true

Get API Request: Get requests for a public marine REST API

import requests
def get_data_from_marine_api():
    url = f'https://geocoding-api.open-meteo.com/v1/search?name=Singapore&count=10&language=en&format=json'
    headers = {"content-type": "application/json"}
    response = requests.get(url,headers = headers)
    return response
response = get_data_from_marine_api()
data = response.json()
data_frame = pd.json_normalize(data)

Using Snowpark: To save the RAW response to the Landing Zone table.

from snowflake.snowpark.context import get_active_session
session = get_active_session()
df1 = session.create_dataframe(data_frame) 
df1.write.mode ("overwrite").save_as_table("RAW_GEO_LOCATIONS")

Using Snowpark: To flatten the JSON for further transformations and combine with operational data for further business rules and logics. This notebook can be orchestrated in Snowflake to synchronize with the existing data pipelines.

import snowflake.snowpark as snowpark
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import col
from snowflake.snowpark.functions import *

session = get_active_session()
flatten_function = table_function("flatten")

geo_locations_raw = session.table("RAW_GEO_LOCATIONS")
geo_locations_tr = geo_locations_raw.join_table_function(flatten_function(geo_locations_raw["RESULTS"])).drop(["SEQ","PATH","RESULTS" "THIS","GENERATIONTIME_MS"])
geo_locations_trf = geo_locations_tr.select(col("index").alias("index"),col("VALUE")["country"].alias("country"),col("VALUE")["country_code"].alias("country_code"),col("VALUE")["longitude"].alias("long"),col("VALUE")["latitude"].alias("lat"),col("VALUE")["name"].alias("name"),col("VALUE")["population"].alias("population"),col("VALUE")["timezone"].alias("timezone"),col("VALUE")["elevation"].alias("elevation"))

geo_locations_trf.write.mode("overwrite").save_as_table("TR_GEO_LOCATIONS")

The Snowflake External Access Integration advantage

Native feature of Snowflake which eliminates the need for moving data from one environment to another.
Can be integrated into the existing data pipelines in Snowflake promptly and hence, allows for easy maintenance.
Can use Snowflake’s Snowpark features and native functions for any data transformations.
Snowflake’s unified compute environment decreases the cost and enhances the efficiency of data pipelines by reducing the latency.
Users can not only call the REST APIs via Snowflake external access integration but also web services that are defined by SOAP protocols.

Below is sample code for calling SOAP-based services:

import requests
def get_data_from_web_service():
    url = f'https://www.w3schools.com/xml/tempconvert.asmx'
    headers = {"content-type": "application/soap+xml"}
    xml ="""

    
    
        20
    
    
"""
    response = requests.post(url,headers = headers,data=xml)
    return response

response = get_data_from_web_service()
print(response.content)

Summary

The maritime industry, like many others, is embracing digital transformation, driven by the increasing volume and variety of data from complex systems, sensors, agencies, and regulatory bodies. This shift opens new opportunities for leveraging data from diverse sources to drive advanced analytics and machine learning. Snowflake provides a robust platform to support these efforts, offering efficient integration capabilities and external access features that make it easy to handle data from REST APIs. Its flexibility and scalability make Snowflake a valuable tool in helping the maritime industry harness the power of data for improved decision-making and operational efficiency.

The post Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management appeared first on Tiger Analytics.

Building Trusted Data: A Comprehensive Guide to Tiger Analytics’ Snowflake Native Data Quality Framework

onemg — Fri, 24 Jan 2025 13:06:13 +0000

A 2024 report on data integrity trends and insights found that 50% of the 550 leading data and analytics professionals surveyed believed data quality is the number one issue impacting their organization’s data integration projects. And that’s not all. Poor data quality was also negatively affecting other initiatives meant to improve data integrity with 67% saying they don’t completely trust the data used for decision-making. As expected, data quality is projected to be a top priority investment for 2025.

Trusted, high-quality data is essential to make informed decisions, deliver exceptional customer experiences, and stay competitive. However, maintaining quality is not quite so simple, especially as data volume grows. Data arrives from diverse sources, is processed through multiple systems, and serves a wide range of stakeholders, increasing the risk of errors and inconsistencies. Poor data quality can lead to significant challenges, including:

Operational Inefficiencies: Incorrect or incomplete data can disrupt workflows and increase costs.
Lost Revenue Opportunities: Decisions based on inaccurate data can result in missed business opportunities.
Compliance Risks: Regulatory requirements demand accurate and reliable data; failure to comply can result in penalties.
Eroded Trust: Poor data quality undermines confidence in data-driven insights, impacting decision-making and stakeholder trust.

Manual approaches to data quality are no longer sustainable in modern data environments. Organizations need a solution that operates at scale without compromising performance, integrates seamlessly into existing workflows and platforms, and provides actionable insights for continuous improvement.

This is where Tiger Analytics’ Snowflake Native Data Quality Framework comes into play, leveraging Snowflake’s unique capabilities to address these challenges effectively.

Tiger Analytics’ Snowflake Native Data Quality Framework – An Automated and Scalable Solution

At Tiger Analytics, we created a custom solution leveraging Snowpark, Great Expectations (GE), Snowflake Data Metric Functions, and Streamlit to redefine data quality processes. By designing this framework as Snowflake-native, we capitalize on the platform’s capabilities for seamless integration, scalability, and performance.

Snowflake’s native features offer significant advantages when building a Data Quality (DQ) framework, addressing the evolving needs of data management and governance. These built-in tools streamline processes, ensuring efficient monitoring, validation, and enhancement of data quality throughout the entire data lifecycle:

Efficient Processing with Snowpark:
Snowpark lets users run complex validations and transformations directly within Snowflake. Its ability to execute Python, Java, or Scala workloads ensures that data remains in place, eliminating unnecessary movement and boosting performance.
Flexible and Predefined DQ Checks:
The inclusion of Great Expectations and Snowflake Data Metric Functions enables a hybrid approach, combining open-source flexibility with Snowflake-native precision. This ensures that our framework can cater to both standard and custom business requirements.
Streamlined Front-End with Streamlit:
Streamlit provides an interactive interface for configuring rules, schedules, and monitoring results, making it accessible to users of all skill levels.
Cost and Latency Benefits:
By eliminating the need for external tools, containers, or additional compute resources, our framework minimizes latency and reduces costs. Every process is optimized to leverage Snowflake’s compute clusters for maximum efficiency.
Integration and Automation:
Snowflake’s task scheduling, streams, and pipelines ensure seamless integration into existing workflows. This makes monitoring and rule execution effortless and highly automated.

Tiger Analytics’ Snowflake Native Data Quality Framework leverages Snowflake’s ecosystem to provide a scalable and reliable data quality solution that can adapt to the changing needs of modern businesses.

Breaking Down the Tiger Analytics’ Snowflake Native Data Quality Framework

Streamlit App: A Unified Interface for Data Quality
Serves as a centralized front-end, integrating multiple components of the data quality framework. It allows users to configure rules and provides access to the profiler, recommendation engine, scheduling, and monitoring functionalities – all within one cohesive interface.

This unified approach simplifies the management and execution of data quality processes, ensuring seamless operation and improved user experience
Data Profiler
Data profiler automatically inspects and analyzes datasets to identify anomalies, missing values, duplicates, and other data quality issues directly within Snowflake. It helps generate insights into the structure and health of the data, without requiring external tools.

It also provides metrics on data distribution, uniqueness, and other characteristics to help identify potential data quality problems
DQ Rules Recommendation Engine
The DQ Rules Recommendation Engine analyzes data patterns and profiles to suggest potential data quality rules based on profiling results, metadata, or historical data behavior. These recommendations can be automatically generated and adjusted for more accurate rule creation.
DQ Engine
The DQ Engine is the core of Tiger Analytics’ Snowflake Native Data Quality Framework. Built using Snowpark, Great Expectations, and Snowflake Data Metric Functions, it ensures efficient and scalable data quality checks directly within the Snowflake ecosystem. Key functionalities include:
- Automated Expectation Suites:
  The engine automatically generates Great Expectations expectation suites based on the configured rules, minimizing manual effort in setting up data quality checks.
- Snowpark Compute Execution:
  These expectation suites are executed using Snowpark’s compute capabilities, ensuring performance and scalability for even the largest datasets.
- Results Storage and Accessibility:
  All validation results are stored in Snowflake tables, making them readily available for monitoring, dashboards, and further processing.
- On-Demand Metric Execution:
  In addition to GE rules, the engine can execute Snowflake Data Metric Functions on demand, providing flexibility for ad hoc or predefined data quality assessments. This combination of automation, scalability, and seamless integration ensures that the DQ Engine is adaptable to diverse data quality needs.
Scheduling Engine
The Scheduling Engine automates the execution of DQ rules at specified intervals, such as on-demand, daily, or in sync with other data pipelines. By leveraging Snowflake tasks & streams, it ensures real-time or scheduled rule execution within the Snowflake ecosystem, enabling continuous data quality monitoring.
Alerts and Notifications
The framework integrates with Slack and Outlook to send real-time alerts and notifications about DQ issues. When a threshold is breached or an issue is detected, stakeholders are notified immediately, enabling swift resolution.
NLP-Based DQ Insights
Leveraging Snowflake Cortex, the NLP-powered app enables users to query DQ results using natural language, providing non-technical users with straightforward access to valuable data quality insights. Users can ask questions such as below and receive clear, actionable insights directly from the data.
- What are the current data quality issues?
- Which rules are failing the most?
- How has data quality improved over time?
DQ Dashboards
These dashboards offer a comprehensive view of DQ metrics, trends, and rule performance. Users can track data quality across datasets and monitor improvements over time. It also provides interactive visualizations to track data health. Drill-down capabilities provide in-depth insight into specific issues, allowing for more detailed analysis and understanding.
Data Pipeline Integration
The framework can be integrated with existing data pipelines, ensuring that DQ checks are part of the ETL/ELT process. These checks are automatically triggered as part of the data pipeline workflow, verifying data quality before downstream usage.

How the Framework Adds Value

As organizations rely more on data to guide strategies, ensuring the accuracy, consistency, and integrity of that data becomes a top priority. Tiger Analytics’ Snowflake Native Data Quality Framework addresses this need by providing a comprehensive, end-to-end solution that integrates seamlessly into your existing Snowflake environment. With customizable features and actionable insights, it empowers teams to act quickly and efficiently. Here are the key benefits explained:

End-to-End Solution: Everything from profiling to monitoring is integrated in one place.
Customizable: Flexibly configure rules, thresholds, and schedules to meet your specific business requirements.
Real-Time DQ Enforcement: Maintain data quality throughout the entire data lifecycle with real-time checks.
Seamless Integration: Fully native to Snowflake, integrates easily with existing data pipelines and workflows.
Actionable Insights: Provide clear, actionable insights to help users take corrective actions quickly.
Scalability: Leverages Snowflake’s compute power, allowing for easy scaling as data volume grows.
Minimal Latency: Ensures efficient processing and reduced delays by executing DQ checks directly within Snowflake.
User-Friendly: Intuitive interface for both technical and non-technical users, enabling broad organizational adoption.
Proactive Monitoring: Identify data quality issues before they affect downstream processes.
Cost-Efficiency: Reduces the need for external tools, minimizing costs and eliminating data movement overhead.

Next Steps

While the framework offers a wide range of features to address data quality needs, we are continuously looking for opportunities to enhance its functionality. We at Tiger Analytics are exploring additional improvements that will further streamline processes, and increase flexibility. Some of the enhancements we are currently working on include:

AI-Driven Recommendations: Use machine learning to improve and refine DQ rule suggestions.
Anomaly Detection: Leverage AI to detect unusual patterns and data quality issues that may not be captured by traditional rules.
Advanced Visualizations: Enhance dashboards with predictive analytics and deeper trend insights.
Expanded Integration: Explore broader support for hybrid cloud and multi-database environments.

A streamlined data quality framework redefines how organizations ensure and monitor data quality. By leveraging Snowflake’s capabilities and tools like SnowPark, our Snowflake Native Data Quality Framework simplifies complex processes and delivers measurable value.

The post Building Trusted Data: A Comprehensive Guide to Tiger Analytics’ Snowflake Native Data Quality Framework appeared first on Tiger Analytics.

Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation

TA@2023 — Mon, 21 Oct 2024 14:30:41 +0000

Why is Data Transformation important?

E owns a retail business and makes their daily sales through their online and brick-and-mortar stores. Every day, they receive a huge wave of data in their systems—products viewed, orders placed, returns received, new customers, repeat customers, etc. Data ingestion pipelines help them ingest data from disparate sources. The ingested raw data pool is large and can generate valuable insights, but how do they harmonize it and turn it into actionable insights that boost the business?

That’s where Data transformation comes in.

Customers may have diverse data formats and reporting requirements, and that’s why data transformation forms a crucial aspect of any analytics product. If we examine traditional methods of manually building data transformation pipelines, we can see that these can be time-consuming and often delay the life cycle of deriving insights from data. One solution to this challenge is implementing dynamic transformation using metadata configuration. A flexible pipeline capable of adapting to data processing needs can be created by defining transformation rules and specifications.

Modernizing Data Platforms with Snowflake: Our Snowpark & Streamlit Solution

At Tiger Analytics, we created a custom solution with Snowpark and Streamlit to build dynamic data transformation pipelines based on a metadata-driven framework to accelerate data platform modernization into Snowflake. Once data is available in the raw layer, the current framework can be leveraged to build a data pipeline to transform and harmonize data in subsequent layers. The high-level architecture of this Snowpark framework is depicted below in Figure 1.

Figure 1: High-level architecture

Tiger’s Snowpark Data Transformation Framework Provides Five Key Functionalities:

Metadata Setup

Once the data ingestion is complete, this Snowpark-based framework is leveraged by users to feed in Source to Target Mapping (STTM) and additional driver (technical) metadata (Incremental/Full Load, Schedule timing, SCD types, etc.), which drives the entire process. The input details are parsed, validated with Snowflake raw layer, and merged into metadata tables.

Dynamic Transformation Pipeline Generation

After the metadata setup is complete, the next Snowpark code reads the STTM metadata table and creates data transformation SQL dynamically. Based on additional driver metadata information, the Snowpark code implements change data capture (CDC) based on watermark/audit columns on top of raw layer ingested data sets, Truncate and Load, Dynamic merge/insert SQL for SCD types, etc. It is purely driven by user-provided metadata input, and any changes will be refactored in the pipeline during its generation. For this Snowpark framework, it is just a matter of a few clicks to accommodate any changes like

New data source, logic to be included
Logic changes to existing pipeline

Schedule and Execution

Snowflake Task is used to implement orchestration for data transformation pipelines. It is created dynamically using the Snowpark framework. Users can provide preferred cadence information as input, which can be referenced to create tasks dynamically with Cron schedule. Once the Tasks are created and resumed, they will automatically trigger in the scheduled cadence. As this is a metadata-driven framework, the good part is that any changes will be accommodated in the pipeline quickly.

Alert & Notification

A Snowpark-based email notification framework helps trigger emails in Outlook and Slack notifications in case of any failures observed during data transformation pipeline execution.

Detailed Logging

This Snowpark framework captures detailed logging of each step for ease of debugging. It also provides information about the pipeline execution status, which can be a key metric for other downstream modules like Observability.

This Snowpark metadata driven framework can be leveraged in any data transformation program in Snowflake to set up dynamic transformation pipelines. It can help to accelerate the overall journey and will implement the business logic efficiently, with reduced code quality issues which often arise because of human errors.

Comparing our Snowpark Framework to Other Data Transformation Pipelines

What makes our Snowpark framework unique compared to other traditional data transformation pipelines?

Agility: Post finalization of the data model, this framework helps in faster implementation of transformation pipelines dynamically in Snowflake.
Flexibility: Allows users to make easy modifications in the logic to adapt to the specific needs of each customer without manually changing the underlying code.
Efficiency: Less efforts are needed to implement changes; when input metadata is changed, the pipeline gets updated.
Auto alerts: An email is triggered based on Snowflake’s native email integration feature to alert users of any pipeline execution failures.
Multi-feature support: Provides flexibility to enable SCD-type implementations, change data capture, and e2e detailed orchestration leveraging Snowflake Tasks.
Pipeline Monitoring: Proper logs are maintained at each level to trace back easily in case of any issues. Proven highly impactful, reducing the overall time for the support team to debug.
Scalability: Data volumes keep increasing. Dynamic transformation pipelines in cloud data warehouses like Snowflake can scale with the increasing demand.
Ease of Maintenance: Simple to maintain and update as metadata changes are made directly from the Streamlit app without altering the pipeline manually.

Final thoughts

At Tiger Analytics, we created a flexible metadata-driven solution to generate dynamic data transformation pipelines. With the help of the Snowpark framework, business teams can now rapidly convert their Source-to-Target Mapping (STTM) logic into executable pipelines, significantly reducing the time to market.

The ability to quickly adapt transformation rules and handle changes through a simple, metadata-driven process ensures that businesses can harmonize raw data from multiple silos without delays, delivering faster insights and value.

With its scalability, adaptability, and ease of maintenance, this framework allows organizations to manage complex transformations effectively. In short, dynamic data transformation ensures that the data is always optimized for decision-making, empowering businesses to stay agile and competitive.

The post Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation appeared first on Tiger Analytics.

How to Simplify Data Profiling and Management with Snowpark and Streamlit

TA@2023 — Thu, 10 Oct 2024 12:41:48 +0000

The accuracy of the data-to-insights journey is underpinned by one of the most foundational yet often overlooked aspects of data management – Data Quality. While all models need good quality data to generate useful insights and patterns, data quality is especially important across industries like retail, healthcare, and finance. Inconsistent, missing, or duplicate data can impact critical operations, from customer segmentation to and even affect regulatory compliance, resulting in potential financial or reputational losses.

Let’s look at an example:

A large retail company relies on customer data from various sources, such as online orders, in-store purchases, and loyalty program interactions. Over time, inconsistencies and errors in the customer database, such as duplicate records, incorrect addresses, and missing contact details, impacted the company’s ability to deliver personalized marketing campaigns, segment customers accurately, and forecast demand.

Data Profiling Matters – Third-party or Native app? Understanding the options

Data profiling helps the organization understand the nature of the data to build the data models, and ensures data quality and consistency, enabling faster decision-making and more accurate insights.

Improves Data Accuracy: Identifies inconsistencies, errors, and missing values.
Supports Better Decision-Making: Ensures reliable data for predictive analytics.
Enhances Efficiency: Helps detect and remove redundant data, optimizing resources and storage.

For clients using Snowflake for data management purposes, traditional data profiling tools often require moving data outside of Snowflake, creating complexity, higher costs, and security risks.

Data Transfer Overhead: External tools may require data to be moved out of Snowflake, increasing latency and security risks.
Scalability Limitations: Third-party tools may struggle with large Snowflake datasets.
Cost and Performance: Increased egress costs and underutilization of Snowflake’s native capabilities.
Integration Complexity: Complex setup and potential incompatibility with Snowflake’s governance and security features.

At Tiger Analytics, our clients faced a similar problem statement. To address these issues, we developed a Snowflake Native App utilizing Snowpark and Streamlit to perform advanced data profiling and analysis within the Snowflake ecosystem. This solution leverages Snowflake’s virtual warehouses for scalable, serverless computational power, enabling efficient profiling without external infrastructure.

How Snowpark Makes Data Profiling Simple and Effective

Snowpark efficiently manages large datasets by chunking data into smaller pieces, ensuring smooth profiling tasks. We execute YData Profiler and custom Python functions directly within Snowflake, storing results like outlier detection and statistical analysis for historical tracking.

We also created stored procedures and UDFs with Snowpark to automate daily or incremental profiling jobs. The app tracks newly ingested data, using Snowflake’s Task Scheduler to run operations automatically. Additionally, profiling outputs integrate seamlessly into data pipelines, with alerts triggered when anomalies are detected, ensuring continuous data quality monitoring.

By keeping operations within Snowflake, Snowpark reduces data transfer, lowering latency and enhancing performance. Its native integration ensures efficient, secure, and scalable data profiling.

Let’s look at the key features of the app, built leveraging Snowpark’s capabilities.

Building a Native Data Profiling App in Snowflake – Lessons learnt:

1. Comprehensive Data Profiling

At the core of the app’s profiling capabilities are the YData Profiler or custom-built profilers – Python libraries, integrated using Snowpark. These libraries allow users to profile data directly within Snowflake by leveraging its built-in compute resources.

Key features include:

Column Summary Statistics: The Quickly review important statistics for columns with all the datatypes like string, number, and date to understand the data at a glance.
Data Completeness Checks: Identify missing values and assess the completeness of your datasets to ensure no critical information is overlooked.
Data Consistency Checks: Detect anomalies or inconsistent data points to ensure that your data is uniform and accurate across the board.
Pattern Recognition and Value Distribution: Analyze data patterns and value distributions to identify trends or detect unusual values that might indicate data quality issues.
Overall Data Quality Checks: Review the health of your dataset by identifying potential outliers, duplicates, or incomplete data points.

2. Snowflake Compute Efficiency

The app runs entirely within Snowflake’s virtual warehouse environment. No external servers or machines are needed, as the app fully utilizes Snowflake’s built-in computing power. This reduces infrastructure complexity while ensuring top-tier performance, allowing users to profile and manage even large datasets efficiently.

3. Flexible Profiling Options

The app allows users to conduct profiling in two distinct ways—either by examining entire tables or by focusing on specific columns. This flexibility ensures that users can tailor the profiling process to their exact needs, from broad overviews to highly targeted analyses.

4. Full Data Management Capabilities

In addition to profiling, the app supports essential data management tasks. Users can insert, update, and delete records within Snowflake directly from the app, providing an all-in-one tool for both profiling and managing data.

5. Streamlit-Powered UI for Seamless Interaction

The app is built using Streamlit, which provides a clean, easy-to-use user interface. The UI allows users to interact with the app’s profiling and data management features without needing deep technical expertise. HTML-based reports generated by the app can be easily shared with stakeholders, offering clear and comprehensive data insights.

6. Ease in Generating and Sharing Profiling Reports

Once the data profiling is complete, the app generates a pre-signed URL that allows users to save and share the profiling reports. Here’s how it works:

Generating Pre-Signed URLs: The app creates a pre-signed URL to a file on a Snowflake stage using the stage name and relative file path. This URL provides access to the generated reports without requiring direct interaction with Snowflake’s internal storage.
Accessing Files: Users can access the files in the stage through several methods:
- Navigate directly to the pre-signed URL in a web browser.
- Retrieve the pre-signed URL within Snowsight by clicking on it in the results table.
- Send the pre-signed URL in a request to the REST API for file support.
Handling External Stages: For files in external stages that reference Microsoft Azure cloud storage, the function requires Azure Active Directory authentication. This is because querying the function fails if the container is accessed using a shared access signature (SAS) token. The GET_PRESIGNED_URL function requires Azure Active Directory authentication to create a user delegation SAS token, utilizing a storage integration object that stores a generated service principal.

7. Different roles within an organization can utilize this app in various scenarios:

Data Analysts: Data analysts can use the app to profile datasets, identify inconsistencies, and understand data quality issues. They will analyze the patterns and relationships in the data and point out the necessary fixes to resolve any errors, such as missing values or outliers.
Data Stewards/Data Owners: After receiving insights from data analysts, data stewards or data owners can apply the suggested fixes to cleanse the data, ensuring it meets quality standards. They can make adjustments directly through the app by inserting, updating, or deleting records, ensuring the data is clean and accurate for downstream processes.

This collaborative approach between analysts and data stewards ensures that the data is high quality and reliable, supporting effective decision-making across the organization.

Final notes

Snowpark offers a novel approach to data profiling by bringing it into Snowflake’s native environment. This approach reduces complexity, enhances performance, and ensures security. Whether improving customer segmentation in retail, ensuring compliance in healthcare, or detecting fraud in finance, Snowflake Native Apps with Snowpark provides a timely solution for maintaining high data quality across industries.

For data engineers looking to address client pain points this translates to:

Seamless Deployment: Easily deployable across teams or accounts, streamlining collaboration.
Dynamic UI: The Streamlit-powered UI provides an interactive dashboard, allowing users to profile data without extensive technical knowledge.
Flexibility: Supports profiling of both Snowflake tables and external files (e.g., CSV, JSON) in external stages like S3 or Azure Blob.

With upcoming features like AI-driven insights, anomaly detection, and hierarchical data modeling, Snowpark provides a powerful and flexible platform for maintaining data quality across industries, helping businesses make smarter decisions and drive better outcomes.

The post How to Simplify Data Profiling and Management with Snowpark and Streamlit appeared first on Tiger Analytics.

Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution

TA@2023 — Fri, 26 Jul 2024 10:07:10 +0000

In the high-stakes world of financial technology, data is king. But what happens when that data becomes a labyrinth of inconsistencies? This was the challenge faced by a senior data engineer at a leading fintech company.

“Our merchant identification system is failing us. We’re losing millions in potential revenue and our fraud detection is compromised. We need a solution, fast.”

The issue was clear but daunting. Every day, their system processed millions of transactions, each tied to a merchant. But these merchant names needed to be more consistent. For instance, a well-known retail chain might be listed as its official name, a common abbreviation, a location-specific identifier, or simply a generic category. This inconsistency was wreaking havoc across the business.

Initial Approach: Snowflake SQL Procedures

Initially, the data engineer and his team developed Snowflake SQL procedures to handle this complex data transformation. While these procedures worked they wanted to add the automated testing pipelines and quickly realized the limitations. “We need more robust regression and automated testing capabilities. And we need to implement these tests without constantly connecting to a Snowflake account.” This capability wasn’t possible with traditional Snowflake SQL procedures, pushing them to seek external expertise.

Enter Tiger Analytics: A New Approach with Snowpark and Local Testing Framework

After understanding the challenges, the Tiger team proposed a solution: leveraging Snowpark for complex data transformations and introducing a local testing framework. This approach aimed to solve the merchant identity issue and improve the entire data pipeline process.

To meet these requirements, the team turned to Snowpark. Snowpark enabled them to perform complex data transformations and manipulations within Snowflake, leveraging the power of Snowflake’s computational engine. However, the most crucial part was the Snowpark Python Local Testing Framework. This framework allowed the team to develop and test their Snowpark DataFrames, stored procedures, and UDFs locally, fulfilling the need for regression testing and automated testing without connecting to a Snowflake account.

Key Benefits

Local Development: The team could develop and test their Snowpark Python code without a Snowflake account. This reduced the barrier to entry and sped up their iteration cycles.
Efficient Testing: By utilizing familiar testing frameworks like PyTest, the team integrated their tests seamlessly into existing development workflows.
Enhanced Productivity: The team quickly iterated on their code with local feedback, enabling rapid prototyping and troubleshooting before deploying to their Snowflake environment.

Overcoming Traditional Unit Testing Limitations

In the traditional sense of unit testing, Snowpark does not support a fully isolated environment independent of a Snowflake instance. Typically, unit tests would mock a database object, but Snowpark lacks a local context for such mocks. Even using the create_dataframe method requires Snowflake connectivity.

The Solution with Local Testing Framework

Despite these limitations, the Snowpark Python Local Testing Framework enabled the team to create and manipulate DataFrames, stored procedures, and UDFs locally, which was pivotal for our use case. Here’s how the Tiger team did it:

Setting Up the Environment

First, set up a Python environment:

pip install "snowflake-snowpark-python[localtest]"
pip install pytest

Next, create a local testing session:

from snowflake.snowpark import Session
session = Session.builder.config('local_testing', True).create()

Creating Local DataFrames

The Tiger team created DataFrames from local data sources and operated on them:

table = 'example'
session.create_dataframe([[1, 2], [3, 4]], ['a', 'b']).write.save_as_table(table)

Operating on these DataFrames was straightforward:

df = session.create_dataframe([[1, 2], [3, 4]], ['a', 'b'])
res = df.select(col('a')).where(col('b') > 2).collect()
print(res)

Creating UDFs and Stored Procedures

The framework allowed the team to create and call UDFs and stored procedures locally:

from snowflake.snowpark.functions import udf, sproc, call_udf, col
from snowflake.snowpark.types import IntegerType, StringType

@udf(name='example_udf', return_type=IntegerType(), input_types=[IntegerType(), IntegerType()])
def example_udf(a, b):
    return a + b

@sproc(name='example_proc', return_type=IntegerType(), input_types=[StringType()])
def example_proc(session, table_name):
    return session.table(table_name)\
        .with_column('c', call_udf('example_udf', col('a'), col('b')))\
        .count()

# Call the stored procedure by name
output = session.call('example_proc', table)

Using PyTest for Efficient Testing

The team leveraged PyTest for efficient unit and integration testing:

PyTest Fixture

In the conftest.py file, the team created a PyTest fixture for the Session object:

import pytest
from snowflake.snowpark.session import Session

def pytest_addoption(parser):
    parser.addoption("--snowflake-session", action="store", default="live")

@pytest.fixture(scope='module')
def session(request) -> Session:
    if request.config.getoption('--snowflake-session') == 'local':
        return Session.builder.configs({'local_testing': True}).create()
    else:
        snowflake_credentials = {} # Specify Snowflake account credentials here
        return Session.builder.configs(snowflake_credentials).create()

Using the Fixture in Test Cases

from project.sproc import my_stored_proc
  
def test_create_fact_tables(session):
    expected_output = ...
    actual_output = my_stored_proc(session)
    assert expected_output == actual_output

Running Tests

To run the test suite locally:

pytest --snowflake-session local

To run the test suite against your Snowflake account:

pytest

Addressing Unsupported Functions

Some functions were not supported in the local testing framework. For these, the team used patch functions and MagicMock:

Patch Functions

For unsupported functions like upper(), used patch functions:

from unittest.mock import patch
from snowflake.snowpark import Session
from snowflake.snowpark.functions import upper
  
session = Session.builder.config('local_testing', True).create()
  
@patch('snowflake.snowpark.functions.upper')
def test_to_uppercase(mock_upper):
    mock_upper.side_effect = lambda col: col + '_MOCKED'
    df = session.create_dataframe([('Alice',), ('Bob',)], ['name'])
    result = df.select(upper(df['name']))
    collected = result.collect()
    assert collected == [('Alice_MOCKED',), ('Bob_MOCKED',)]

MagicMock

For more complex behaviors like explode(), team used MagicMock:

from unittest.mock import MagicMock
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col

session = Session.builder.config('local_testing', True).create()

def test_explode_df():
    mock_explode = MagicMock()
    mock_explode.side_effect = lambda: 'MOCKED_EXPLODE'
    df = session.create_dataframe([([1, 2, 3],), ([4, 5, 6],)], ['data'])
    with patch('snowflake.snowpark.functions.explode', mock_explode):
        result = df.select(col('data').explode())
        collected = result.collect()
        assert collected == ['MOCKED_EXPLODE', 'MOCKED_EXPLODE']

test_explode_df()

Scheduling Procedures Limitations

While implementing these solutions, the Tiger team faced issues with scheduling procedures using serverless tasks, so they used the Task attached to the warehouse. They created the Snowpark-optimized warehouse. The team noted that serverless tasks cannot invoke certain object types and functions, specifically:

UDFs (user-defined functions) that contain Java or Python code.
Stored procedures are written in Scala (using Snowpark) or those that call UDFs containing Java or Python code.

Turning Data Challenges into Business Insights

The journey from the initial challenge of extracting merchant identities from inconsistent transaction data to a streamlined, efficient process demonstrates the power of advanced data solutions. The Tiger team leveraged Snowpark and its Python Local Testing Framework, not only solving the immediate problem but also enhancing their overall approach to data pipeline development and testing. The combination of regex-based, narration’s pattern-based, and ML-based methods enabled them to tackle the complexity of unstructured bank statement data effectively.

This project’s success extends beyond merchant identification, showcasing how the right tools and methodologies can transform raw data into meaningful insights. For data engineers facing similar challenges, this case study highlights how Snowpark and local testing frameworks can significantly improve data application development, leading to more efficient, accurate, and impactful solutions.

The post Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution appeared first on Tiger Analytics.