Data Integration Archives - Tiger Analytics

Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management

onemg — Fri, 24 Jan 2025 13:10:32 +0000

As the maritime industry navigates tremendous volumes of data, the call for accelerated digitalization is stronger than ever. The maritime sector is a vast and intricate ecosystem where data flows continuously across interconnected sectors—from vessel management and maintenance to fuel optimization and emissions control. As the United Nations Conference on Trade and Development highlighted in its 2024 report, digital transformation through technologies like blockchain, artificial intelligence, and automation is crucial for improving port operations. Ports that have embraced these innovations report reduced waiting times, enhanced cargo tracking, and greater efficiency in transshipment processes.

In this data-intensive environment, operational data from ship-installed software is just the beginning. Third-party sources such as AIS data, weather information, and other cloud applications play a vital role in many maritime use cases. Traditionally, integrating this diverse data—often accessed via REST APIs—required external platforms like AWS Lambda or Databricks.

With Snowflake’s introduction of the External Access Integration feature, maritime organizations can now consolidate API data integration and data engineering workflows within a single, powerful platform. This breakthrough not only simplifies operations but also improves flexibility and efficiency.

Let’s discuss a use case

Suppose we need to retrieve crew rest and work hours data from a third-party regulatory service to generate near real-time, period-specific compliance reports for all vessels managed by a ship manager. These details are made available to the business through REST APIs.

Landscape Dissection and Data Enablement

Let’s assume Snowflake is the chosen cloud data warehouse platform, with Azure serving as the primary solution for data lake requirements. Operational data for vessels from various legacy systems and other sources is integrated into Snowflake. Data pipelines and models are then built on this integrated data to meet business needs. The operational data is ingested into Snowflake through a combination of Snowflake’s native data loading options and the replication tool Fivetran.

Challenges Explained

Outbound REST API calls must be made to retrieve crew rest and work hours data. The semi-structured data from the API response will need to undergo several transformations before it can be integrated with the existing vessel operational data in Snowflake. Additionally, the solution must support the near real-time requirements of the compliance report. The new pipeline should seamlessly align with the current data pipelines for ingestion and transformation, ensuring no disruptions to existing processes.

We now explore Snowflake’s external access integration to address these challenges.

What is Snowflake’s External Access Integration?

Snowflake’s External Access Integration empowers businesses to integrate the data seamlessly from diverse external sources and networks, helping them bridge data gaps and providing a holistic view for better decisions. The feature gives users the flexibility to read external data and integrate only which is necessary for the use case while the majority of the data resides at the source. Key benefits of this feature include:

Enabling real time access to complex third-party data providers
Eliminating data duplication
Enriching data with selective data integration that benefits your use case
Enhanced data-driven decision making

Leveraging Snowflake’s External Access Integration: A Step-by-Step Guide

Here is a complete walkthrough of the procedures to solve our use case:

Step 1: Creating Network Rule

Snowflake enables its accounts to selectively and securely access databases or services via its network rules. This enhances the security by limiting the list of IPs that can connect to Snowflake.
CREATE NETWORK RULE command helps us to add the list of APIs that Snowflake account should connect to.

CREATE [OR REPLACE] NETWORK RULE 
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ()

Step 2: Creating Secret

Securely save the credentials to be used while authenticating to APIs via secrets in Snowflake.
CREATE SECRET command is used to represent the credentials such as username and password, which are used to authenticate the API we have added to the network rule in step 1.

Basic Authentication
CREATE [ OR REPLACE ] SECRET 
TYPE = PASSWORD
USERNAME = ''
PASSWORD = ''

Step 3: Creating External Access Integration

Specify the network rule and secrets used to connect to the APIs via external access integration.
CREATE EXTERNAL ACCESS INTEGRATION command aggregates the allowed network rule and secrets to securely use in UDFs or procedures.

CREATE [ OR REPLACE ] EXTERNAL ACCESS INTEGRATION 
ALLOWED NETWORK RULES = 
ENABLED = TRUE

Step 4: External Call

There are multiple methods to call external APIs – UDFs or procedures or direct calls from Snowflake Notebooks (Preview Feature as of now). Let’s explore Snowflake Notebooks to make external calls via Python. Snowflake Notebooks offer an interactive environment to code your logics in SQL or Python.

To make API calls from a particular notebook, enable the created external access integration in step 3 in your notebook. This can be done from the ‘Notebook settings’ options available for the Snowflake notebooks.
After importing required libraries, call the required APIs and save the response object.
Leverage Snowflake Snowpark framework to operate on the data frames and save your results to Snowflake tables.
Use Snowflake’s native functions to flatten and format the semi structured data that is mostly received as a response from the API calls.
The transformed data via API can be further combined with the operational or modeled data in Snowflake.

Configuration: Creating a network rule and external access integration.

create OR replace network RULE NW_RULE_PUBLIC_API
mode = egress
type = host_port
value_list = ('geocoding-api.open-meteo.com')

create or replace external access integration EAI_PUBLIC_API
allowed_network_rules = (NW_RULE_PUBLIC_API)
enabled = true

Get API Request: Get requests for a public marine REST API

import requests
def get_data_from_marine_api():
    url = f'https://geocoding-api.open-meteo.com/v1/search?name=Singapore&count=10&language=en&format=json'
    headers = {"content-type": "application/json"}
    response = requests.get(url,headers = headers)
    return response
response = get_data_from_marine_api()
data = response.json()
data_frame = pd.json_normalize(data)

Using Snowpark: To save the RAW response to the Landing Zone table.

from snowflake.snowpark.context import get_active_session
session = get_active_session()
df1 = session.create_dataframe(data_frame) 
df1.write.mode ("overwrite").save_as_table("RAW_GEO_LOCATIONS")

Using Snowpark: To flatten the JSON for further transformations and combine with operational data for further business rules and logics. This notebook can be orchestrated in Snowflake to synchronize with the existing data pipelines.

import snowflake.snowpark as snowpark
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import col
from snowflake.snowpark.functions import *

session = get_active_session()
flatten_function = table_function("flatten")

geo_locations_raw = session.table("RAW_GEO_LOCATIONS")
geo_locations_tr = geo_locations_raw.join_table_function(flatten_function(geo_locations_raw["RESULTS"])).drop(["SEQ","PATH","RESULTS" "THIS","GENERATIONTIME_MS"])
geo_locations_trf = geo_locations_tr.select(col("index").alias("index"),col("VALUE")["country"].alias("country"),col("VALUE")["country_code"].alias("country_code"),col("VALUE")["longitude"].alias("long"),col("VALUE")["latitude"].alias("lat"),col("VALUE")["name"].alias("name"),col("VALUE")["population"].alias("population"),col("VALUE")["timezone"].alias("timezone"),col("VALUE")["elevation"].alias("elevation"))

geo_locations_trf.write.mode("overwrite").save_as_table("TR_GEO_LOCATIONS")

The Snowflake External Access Integration advantage

Native feature of Snowflake which eliminates the need for moving data from one environment to another.
Can be integrated into the existing data pipelines in Snowflake promptly and hence, allows for easy maintenance.
Can use Snowflake’s Snowpark features and native functions for any data transformations.
Snowflake’s unified compute environment decreases the cost and enhances the efficiency of data pipelines by reducing the latency.
Users can not only call the REST APIs via Snowflake external access integration but also web services that are defined by SOAP protocols.

Below is sample code for calling SOAP-based services:

import requests
def get_data_from_web_service():
    url = f'https://www.w3schools.com/xml/tempconvert.asmx'
    headers = {"content-type": "application/soap+xml"}
    xml ="""

    
    
        20
    
    
"""
    response = requests.post(url,headers = headers,data=xml)
    return response

response = get_data_from_web_service()
print(response.content)

Summary

The maritime industry, like many others, is embracing digital transformation, driven by the increasing volume and variety of data from complex systems, sensors, agencies, and regulatory bodies. This shift opens new opportunities for leveraging data from diverse sources to drive advanced analytics and machine learning. Snowflake provides a robust platform to support these efforts, offering efficient integration capabilities and external access features that make it easy to handle data from REST APIs. Its flexibility and scalability make Snowflake a valuable tool in helping the maritime industry harness the power of data for improved decision-making and operational efficiency.

The post Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management appeared first on Tiger Analytics.

Tiger’s Snowpark-Based Framework for Snowflake: Illuminating the Path to Efficient Data Ingestion

Ibees . — Thu, 25 Apr 2024 07:05:45 +0000

In the fast-paced world of E-commerce, inventory data is a goldmine of insights waiting to be unearthed. Imagine an online retailer with thousands of products, each with their own unique attributes, stock levels, and sales history. By efficiently ingesting and analyzing this inventory data, the retailer can optimize stock levels, predict demand, and make informed decisions to drive growth and profitability. As data volumes continue to grow and the complexity of data sources increases, the importance of efficient data ingestion becomes even more critical.

With advancements in artificial intelligence (AI) and machine learning (ML), the demand for real-time and accurate data ingestion has reached new heights. AI and ML models, require a constant feed of high-quality data to train, adapt, and deliver accurate insights and predictions. Consequently, organizations must prioritize robust data ingestion strategies to harness the full potential of their data assets and stay competitive in the AI-driven era.

Challenges with Existing Data Ingestion Mechanisms

While platforms like Snowflake offer powerful data warehousing capabilities, the native data ingestion methods provided by Snowflake, such as Snowpipe and the COPY command, often face limitations that hinder scalability, flexibility, and efficiency.

Limitations of the COPY Method

Data Transformation Overhead: Extensive transformation during the COPY process can introduce overhead, which is better performed post-loading.
Limited Horizontal Scalability: COPY struggles to scale efficiently with large data volumes, underutilizing warehouse resources.
File Format Compatibility: Complex formats like Excel require preprocessing for compatibility with Snowflake’s COPY INTO operation.
Data Validation and Error Handling: Snowflake’s validation during COPY is limited; additional checks can burden performance.
Manual Optimization: Achieving optimal performance with COPY demands meticulous file size and concurrency management, adding complexity.

Limitations of Snowpipe

Lack of Upsert Support: Snowpipe lacks direct upsert functionality, necessitating complex workarounds.
Limited Real-Time Capabilities: While near-real-time, Snowpipe may not meet the needs for instant data availability or complex streaming transformations.
Scheduling Flexibility: Continuous operation limits precise control over data loading times.
Data Quality and Consistency: Snowpipe offers limited support for data validation and transformation, requiring additional checks.
Limited Flexibility: Snowpipe is optimized for streaming data into Snowflake, limiting custom processing and external integrations.
Support for Specific Data Formats: Snowpipe supports delimited text, JSON, Avro, Parquet, ORC, and XML (using Snowflake XML format), necessitating conversion for unsupported formats.

Tiger’s Snowpark-Based Framework – Transforming Data Ingestion

To address these challenges and unlock the full potential of data ingestion, organizations are turning to innovative solutions that leverage advanced technologies and frameworks. One such solution we’ve built, is Tiger’s Snowpark-based framework for Snowflake.

Our solution transforms data ingestion by offering a highly customizable framework driven by metadata tables. Users can efficiently tailor ingestion processes to various data sources and business rules. Advanced auditing and reconciliation ensure thorough tracking and resolution of data integrity issues. Additionally, built-in data quality checks and observability features enable real-time monitoring and proactive alerting. Overall, the Tiger framework provides a robust, adaptable, and efficient solution for managing data ingestion challenges within the Snowflake ecosystem.

Key features of Tiger’s Snowpark-based framework include:

Configurability and Metadata-Driven Approach:

Flexible Configuration: Users can tailor the framework to their needs, accommodating diverse data sources, formats, and business rules.
Metadata-Driven Processes: The framework utilizes metadata tables and configuration files to drive every aspect of the ingestion process, promoting consistency and ease of management.

Advanced Auditing and Reconciliation:

Detailed Logging: The framework provides comprehensive auditing and logging capabilities, ensuring traceability, compliance, and data lineage visibility.
Automated Reconciliation: Built-in reconciliation mechanisms identify and resolve discrepancies, minimizing errors and ensuring data integrity.

Enhanced Data Quality and Observability:

Real-Time Monitoring: The framework offers real-time data quality checks and observability features, enabling users to detect anomalies and deviations promptly.
Custom Alerts and Notifications: Users can set up custom thresholds and receive alerts for data quality issues, facilitating proactive monitoring and intervention.

Seamless Transformation and Schema Evolution:

Sophisticated Transformations: Leveraging Snowpark’s capabilities, users can perform complex data transformations and manage schema evolution seamlessly.
Adaptability to Changes: The framework automatically adapts to schema changes, ensuring compatibility with downstream systems and minimizing disruption.

Data continues to be the seminal building block that determines the accuracy of the output. As businesses race through this data-driven era, investing in robust and future-proof data ingestion frameworks will be key to translating data into real-world insights.

The post Tiger’s Snowpark-Based Framework for Snowflake: Illuminating the Path to Efficient Data Ingestion appeared first on Tiger Analytics.

Migrating from Legacy Systems to Snowflake: Simplifying Excel Data Migration with Snowpark Python

Ibees . — Thu, 18 Apr 2024 05:29:21 +0000

A global manufacturing company is embarking on a digital transformation journey, migrating from legacy systems, including Oracle databases and QlikView for visualization, to Snowflake Data Platform and Power BI for advanced analytics and reporting. What does a day in the life of their data analyst look like?

Their workday is consumed by the arduous task of migrating complex Excel data from legacy systems to Snowflake. They spend hours grappling with detailed Excel files, trying to navigate through multiple headers, footers, subtotals, formulas, macros, and custom formatting. The manual process is time-consuming, and error-prone, and hinders their ability to focus on deriving valuable insights from the data.

To streamline their workday, the data analyst can leverage Snowpark Python’s capabilities to streamline the process. They can effortlessly access and process Excel files directly within Snowflake, eliminating the need for external ETL tools or complex migration scripts. With just a few lines of code, they can automate the extraction of data from Excel files, regardless of their complexity. Formulas, conditional formatting, and macros are handled seamlessly, ensuring data accuracy and consistency.

Many businesses today grapple with the complexities of Excel data migration. Traditional ETL scripts may suffice for straightforward data migration, but heavily customized processes pose significant challenges. That’s where Snowpark Python comes into the picture.

Snowpark Python: Simplifying Excel Data Migration

Snowpark Python presents itself as a versatile tool that simplifies the process of migrating Excel data to Snowflake. By leveraging Snowpark’s file access capabilities, users can directly access and process Excel files within Snowflake, eliminating the need for external ETL tools or complex migration scripts. This approach not only streamlines the migration process but also ensures data accuracy and consistency.

With Snowpark Python, businesses can efficiently extract data from Excel files, regardless of their complexity. Python’s rich ecosystem of libraries enables users to handle formulas, conditional formatting, and macros in Excel files. By integrating Python scripts seamlessly into Snowflake pipelines, the migration process can be automated, maintaining data quality throughout. This approach not only simplifies the migration process but also enhances scalability and performance.

Tiger Analytics’ Approach to Excel Data Migration using Snowpark Python

At Tiger Analytics, we‘ve worked with several Fortune 500 clients on data migration projects. In doing so, we’ve found a robust solution: using Snowpark Python to tackle this problem head-on. Here’s how it works.

We crafted Snowpark code that seamlessly integrates Excel libraries to facilitate data loading into Snowflake. Our approach involves configuring a metadata table within Snowflake to store essential details such as Excel file names, sheet names, and cell information. By utilizing Snowpark Python and standard stored procedures, we have implemented a streamlined process that extracts configurations from the metadata table and dynamically loads Excel files into Snowflake based on these parameters. This approach ensures data integrity and accuracy throughout the migration process, empowering businesses to unlock the full potential of their data analytics workflows within Snowflake. So we’re able to not only accelerate the migration process but also future-proof data operations, enabling organizations to focus on deriving valuable insights from their data.

The advantage of using Snowpark Python is that it enables new use cases for Snowflake customers, allowing them to ingest data from specialized file formats without the need to build and maintain external file ingestion processes. This results in faster development lifecycles, reduced time spent managing various cloud provider services, lower costs, and more time spent adding business value.

For organizations looking to modernize data operations and migrate Excel data from legacy systems into Snowflake, Snowpark Python offers a useful solution. With the right partners and supporting tech, a seamless data migration will pave the way for enhanced data-driven decision-making.

The post Migrating from Legacy Systems to Snowflake: Simplifying Excel Data Migration with Snowpark Python appeared first on Tiger Analytics.

Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security

TA@2023 — Thu, 07 Mar 2024 07:03:07 +0000

Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing. Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

Automate the discovery of sensitive data.
Discover a variety of sensitive data types.
Evaluate and monitor data for security and access control.
Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
- triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
- pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
- checkJobStatus: This Lambda function checks the status of the Macie scan job.
- isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
- waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
- UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.

A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
- We create a one-time Macie job through the triggerMacieJob Lambda function.
- We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
- While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.
  
  The findings can be seen on the AWS console and filtered based on S3 Buckets. We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.
The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
- In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
- In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.
  For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848
  
  The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.
- If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
- In the quarantine bucket, the same file is being moved after finding the sensitive data.
  
  If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
- GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
- StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
- StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
- StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms

onemg — Thu, 25 Jun 2020 10:50:56 +0000

Introduction

The Snowflake Connector for Spark enables the use of Snowflake as a Spark data source – similar to other data sources like PostgreSQL, HDFS, S3, etc. Though most data engineers use Snowflake, what happens internally is a mystery to many. But only if one understands the underlying architecture and its functioning, can they figure out how to fine-tune their performance and troubleshoot issues that might arise. This blog thus aims to explaining in detail the internal architecture and functioning of the Snowflake Connector.

Before getting into the details, let us understand what happens when one does not use the Spark-Snowflake Connector.

Data Loading to Snowflake without Spark- Snowflake Connector

Create Staging Area -> Load local files -> Staging area in cloud -> Create file format -> Load to Snowflake from staging area using the respective file format

Data Loading using Spark-Snowflake Connector

When we use the Spark Snowflake connector to load the data into Snowflake, it does a lot of things that are abstract to us. The connector takes care of all the heavy lifting tasks.

Spark Snowflake Connector (Source:https://docs.snowflake.net/manuals/user-guide/spark-connector-overview.html#interaction-between-snowflake-and-spark)

This blog illustrates one such example where the Spark-Snowflake Connector is used to read and write data in Databricks. Databricks has integrated the Snowflake Connector for Spark into the Databricks Unified Analytics Platform to provide native connectivity between Spark and Snowflake.

The Snowflake Spark Connector supports Internal (temp location managed by Snowflake automatically) and External (temp location for data transfer managed by user) transfer modes. Here is a brief description of the two modes of transfer-

Internal Data Transfer

This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

External Data Transfer

External data transfer, on the other hand, is facilitated through a storage location that the user specifies. The storage location must be created and configured as part of the Spark connector installation/configuration.

Further, the files created by the connector during external transfer are intended to be temporary, but the connector does not automatically delete these files from the storage location. This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

Use Cases

Below are the use cases we are going to run on Spark and see how the Spark Snowflake connector works internally-

1. Initial Loading from Spark to Snowflake

2. Loading the same Snowflake table in Overwrite mode

3. Loading the same Snowflake table in Append mode

4. Read the Snowflake table

Snowflake Connection Parameters

1. Initial Loading from Spark to Snowflake

When a new table is loaded for the very first time from Spark to Snowflake, the following command will be running on Spark. This command, in turn, starts to execute a set of SQL queries in Snowflake using the connector.

The single Spark command above triggers the following 9 SQL queries in Snowflake

Snowflake Initial Load Query History

i) Spark, by default, uses the local time zone. This can be changed by using the sfTimezone option in the connector

ii) The below query creates a temporary internal stage in Snowflake. We can use other cloud providers that we can configure in Spark.

iii) The GET command downloads data files from one of the following Snowflake stages to a local directory/folder on a client machine. We have metadata checks at this stage.

iv) The PUT command uploads (i.e., stages) data files from a local directory/folder onto a client machine to one of the Snowflake strategies

v) The DESC command failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It won’t throw any error in the Spark Job

vi) The IDENTIFIER keyword is used to identify objects by name, using string literals, session variables, or bind variables.

vii) The command below loads data into Snowflake’s temporary table to maintain consistency. By doing so, Spark follows Write All or Write Nothing architecture.

viii) The DESC command below failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It didn’t stop the process. The metadata check at this point is to see whether to use RENAME TO or SWAP WITH

ix) The RENAME TO command renames the specified table with a new identifier that is not currently used by any other table in the schema. If the table is already present, we have to use SWAP WITH and then drop the identifier

2. Loading the same Snowflake table in Overwrite mode

The above Spark command triggers the following 10 SQL queries in Snowflake. This time there is no failure when we ran the overwrite command a second time because this time, the table already exists.

i) We have metadata checks at the internal stages

ii) The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. It also swaps all access control privilege grants. The two tables are essentially renamed in a single transaction.

The RENAME TO command is used when the table is not present because it is faster than renaming and dropping the intermediate table. But this can only be used when the table does not exist in Snowflake. This means that RENAME TO is only performed during the Initial Load.

iii) The DROP command drops the intermediate staging table

3. Loading the same Spark table in Append mode

The above Spark command triggers the following 7 SQL queries in Snowflake.

Note: When we use OVERWRITE mode, the data is copied into the intermediate staged table, but during APPEND, the data is loaded into the actual table in Snowflake.

i) In order to maintain the ACID compliance, this mode uses all the transactions inside the BEGIN and COMMIT. If anything goes wrong, it uses ROLLBACK so that the previous state of the table is untouched.

4. Reading the Snowflake Table

The above Spark command triggers the following SQL query in Snowflake. The reason for this is that Spark follows the Lazy Execution pattern. So until an action is performed, it will not read the actual data. Spark internally maintains lineage to process it effectively. The following query is to check whether the table is present or not and to retrieve only the schema of the table.

The Spark action below triggers 5 SQL queries in Snowflake

i) First, it creates a temporary internal stage to load the read data from Snowflake.

ii) Next, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

iii) The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ.

iv) The data is then copied from Snowflake to the internal stage.

v) Finally, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Wrapping Up

Spark Snowflake connector comes with lots of benefits like query pushdown, column mapping, etc. This acts as an abstract layer and does a lot of groundwork in the back end.

Happy Learning!!

The post Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms appeared first on Tiger Analytics.