Azure Archives - Tiger Analytics Fri, 24 Jan 2025 13:29:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://www.tigeranalytics.com/wp-content/uploads/2023/09/favicon-Tiger-Analytics_-150x150.png Azure Archives - Tiger Analytics 32 32 Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management https://www.tigeranalytics.com/perspectives/blog/navigating-the-digital-seas-how-snowflakes-external-access-integration-streamlines-maritime-data-management/ Fri, 24 Jan 2025 13:10:32 +0000 https://www.tigeranalytics.com/?post_type=blog&p=24209 The maritime industry is increasingly adopting digital transformation to manage vast amounts of data from ships, sensors, weather, and third-party APIs. Snowflake’s External Access Integration simplifies this process by allowing seamless integration of real-time data without duplication. Read on to know how this feature works in practice and how it supports better, data-driven outcomes in the maritime sector.

The post Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management appeared first on Tiger Analytics.

]]>
As the maritime industry navigates tremendous volumes of data, the call for accelerated digitalization is stronger than ever. The maritime sector is a vast and intricate ecosystem where data flows continuously across interconnected sectors—from vessel management and maintenance to fuel optimization and emissions control. As the United Nations Conference on Trade and Development highlighted in its 2024 report, digital transformation through technologies like blockchain, artificial intelligence, and automation is crucial for improving port operations. Ports that have embraced these innovations report reduced waiting times, enhanced cargo tracking, and greater efficiency in transshipment processes.

In this data-intensive environment, operational data from ship-installed software is just the beginning. Third-party sources such as AIS data, weather information, and other cloud applications play a vital role in many maritime use cases. Traditionally, integrating this diverse data—often accessed via REST APIs—required external platforms like AWS Lambda or Databricks.

With Snowflake’s introduction of the External Access Integration feature, maritime organizations can now consolidate API data integration and data engineering workflows within a single, powerful platform. This breakthrough not only simplifies operations but also improves flexibility and efficiency.

Let’s discuss a use case

Suppose we need to retrieve crew rest and work hours data from a third-party regulatory service to generate near real-time, period-specific compliance reports for all vessels managed by a ship manager. These details are made available to the business through REST APIs.

Landscape Dissection and Data Enablement

Let’s assume Snowflake is the chosen cloud data warehouse platform, with Azure serving as the primary solution for data lake requirements. Operational data for vessels from various legacy systems and other sources is integrated into Snowflake. Data pipelines and models are then built on this integrated data to meet business needs. The operational data is ingested into Snowflake through a combination of Snowflake’s native data loading options and the replication tool Fivetran.

Challenges Explained

Outbound REST API calls must be made to retrieve crew rest and work hours data. The semi-structured data from the API response will need to undergo several transformations before it can be integrated with the existing vessel operational data in Snowflake. Additionally, the solution must support the near real-time requirements of the compliance report. The new pipeline should seamlessly align with the current data pipelines for ingestion and transformation, ensuring no disruptions to existing processes.

We now explore Snowflake’s external access integration to address these challenges.

What is Snowflake’s External Access Integration?

Snowflake’s External Access Integration empowers businesses to integrate the data seamlessly from diverse external sources and networks, helping them bridge data gaps and providing a holistic view for better decisions. The feature gives users the flexibility to read external data and integrate only which is necessary for the use case while the majority of the data resides at the source. Key benefits of this feature include:

  • Enabling real time access to complex third-party data providers
  • Eliminating data duplication
  • Enriching data with selective data integration that benefits your use case
  • Enhanced data-driven decision making

Leveraging Snowflake’s External Access Integration: A Step-by-Step Guide

Here is a complete walkthrough of the procedures to solve our use case:

Step 1: Creating Network Rule

  • Snowflake enables its accounts to selectively and securely access databases or services via its network rules. This enhances the security by limiting the list of IPs that can connect to Snowflake.
  • CREATE NETWORK RULE command helps us to add the list of APIs that Snowflake account should connect to.
CREATE [OR REPLACE] NETWORK RULE <nw_rule_name>
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = (<api_url_link>)

Step 2: Creating Secret

  • Securely save the credentials to be used while authenticating to APIs via secrets in Snowflake.
  • CREATE SECRET command is used to represent the credentials such as username and password, which are used to authenticate the API we have added to the network rule in step 1.
Basic Authentication
CREATE [ OR REPLACE ] SECRET <secret_name>
TYPE = PASSWORD
USERNAME = '<username>'
PASSWORD = '<password>'

Step 3: Creating External Access Integration

  • Specify the network rule and secrets used to connect to the APIs via external access integration.
  • CREATE EXTERNAL ACCESS INTEGRATION command aggregates the allowed network rule and secrets to securely use in UDFs or procedures.
CREATE [ OR REPLACE ] EXTERNAL ACCESS INTEGRATION <ext_integration_name>
ALLOWED NETWORK RULES = <nw_rule_name>
ENABLED = TRUE

Step 4: External Call

External Call

There are multiple methods to call external APIs – UDFs or procedures or direct calls from Snowflake Notebooks (Preview Feature as of now). Let’s explore Snowflake Notebooks to make external calls via Python. Snowflake Notebooks offer an interactive environment to code your logics in SQL or Python.

  • To make API calls from a particular notebook, enable the created external access integration in step 3 in your notebook. This can be done from the ‘Notebook settings’ options available for the Snowflake notebooks.
  • After importing required libraries, call the required APIs and save the response object.
  • Leverage Snowflake Snowpark framework to operate on the data frames and save your results to Snowflake tables.
  • Use Snowflake’s native functions to flatten and format the semi structured data that is mostly received as a response from the API calls.
  • The transformed data via API can be further combined with the operational or modeled data in Snowflake.

Configuration: Creating a network rule and external access integration.

create OR replace network RULE NW_RULE_PUBLIC_API
mode = egress
type = host_port
value_list = ('geocoding-api.open-meteo.com')

create or replace external access integration EAI_PUBLIC_API
allowed_network_rules = (NW_RULE_PUBLIC_API)
enabled = true

Get API Request: Get requests for a public marine REST API

import requests
def get_data_from_marine_api():
    url = f'https://geocoding-api.open-meteo.com/v1/search?name=Singapore&count=10&language=en&format=json'
    headers = {"content-type": "application/json"}
    response = requests.get(url,headers = headers)
    return response
response = get_data_from_marine_api()
data = response.json()
data_frame = pd.json_normalize(data)

Using Snowpark: To save the RAW response to the Landing Zone table.

from snowflake.snowpark.context import get_active_session
session = get_active_session()
df1 = session.create_dataframe(data_frame) 
df1.write.mode ("overwrite").save_as_table("RAW_GEO_LOCATIONS")    

Using Snowpark: To flatten the JSON for further transformations and combine with operational data for further business rules and logics. This notebook can be orchestrated in Snowflake to synchronize with the existing data pipelines.

import snowflake.snowpark as snowpark
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import col
from snowflake.snowpark.functions import *

session = get_active_session()
flatten_function = table_function("flatten")

geo_locations_raw = session.table("RAW_GEO_LOCATIONS")
geo_locations_tr = geo_locations_raw.join_table_function(flatten_function(geo_locations_raw["RESULTS"])).drop(["SEQ","PATH","RESULTS" "THIS","GENERATIONTIME_MS"])
geo_locations_trf = geo_locations_tr.select(col("index").alias("index"),col("VALUE")["country"].alias("country"),col("VALUE")["country_code"].alias("country_code"),col("VALUE")["longitude"].alias("long"),col("VALUE")["latitude"].alias("lat"),col("VALUE")["name"].alias("name"),col("VALUE")["population"].alias("population"),col("VALUE")["timezone"].alias("timezone"),col("VALUE")["elevation"].alias("elevation"))

geo_locations_trf.write.mode("overwrite").save_as_table("TR_GEO_LOCATIONS")    

The Snowflake External Access Integration advantage

  • Native feature of Snowflake which eliminates the need for moving data from one environment to another.
  • Can be integrated into the existing data pipelines in Snowflake promptly and hence, allows for easy maintenance.
  • Can use Snowflake’s Snowpark features and native functions for any data transformations.
  • Snowflake’s unified compute environment decreases the cost and enhances the efficiency of data pipelines by reducing the latency.
  • Users can not only call the REST APIs via Snowflake external access integration but also web services that are defined by SOAP protocols.

Below is sample code for calling SOAP-based services:

import requests
def get_data_from_web_service():
    url = f'https://www.w3schools.com/xml/tempconvert.asmx'
    headers = {"content-type": "application/soap+xml"}
    xml ="""
<soap12:Envelope xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="https://www.w3.org/2001/XMLSchema" xmlns:soap12="https://www.w3.org/2003/05/soap-envelope">
    <soap12:Body>
    <CelsiusToFahrenheit xmlns="https://www.w3schools.com/xml/">
        <Celsius>20</Celsius>
    </CelsiusToFahrenheit>
    </soap12:Body>
</soap12:Envelope>"""
    response = requests.post(url,headers = headers,data=xml)
    return response

response = get_data_from_web_service()
print(response.content)

Summary

The maritime industry, like many others, is embracing digital transformation, driven by the increasing volume and variety of data from complex systems, sensors, agencies, and regulatory bodies. This shift opens new opportunities for leveraging data from diverse sources to drive advanced analytics and machine learning. Snowflake provides a robust platform to support these efforts, offering efficient integration capabilities and external access features that make it easy to handle data from REST APIs. Its flexibility and scalability make Snowflake a valuable tool in helping the maritime industry harness the power of data for improved decision-making and operational efficiency.

The post Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management appeared first on Tiger Analytics.

]]>
A Complete Guide to Enabling SAP Data Analytics on Azure Databricks https://www.tigeranalytics.com/perspectives/blog/a-complete-guide-to-enabling-sap-data-analytics-on-azure-databricks/ https://www.tigeranalytics.com/perspectives/blog/a-complete-guide-to-enabling-sap-data-analytics-on-azure-databricks/#comments Fri, 22 Sep 2023 14:22:34 +0000 https://www.tigeranalytics.com/?p=12681 Uncover how SAP data analytics on Azure Databricks empowers organizations by optimizing data processing and analysis and offering a scalable solution for efficient decision-making.

The post A Complete Guide to Enabling SAP Data Analytics on Azure Databricks appeared first on Tiger Analytics.

]]>

Effectively enabling SAP data analytics on Azure Databricks empowers organizations with a powerful and scalable platform that seamlessly integrates with their existing SAP systems. They can efficiently process and analyze vast amounts of data, enabling faster insights and data-driven decision-making.

So, if you’re a senior technical decision-maker in your organization, choosing the proper strategy to consume and process SAP data with Azure Databricks is critical.

First, let’s explore the types of SAP data and objects from the source system. Then, let’s see how the SAP BW can be accessed and made available to Azure Databricks for further processing, analytics, and storing in a Databricks Lakehouse.

Why Enable SAP Data Analytics on Azure Databricks

While SAP provides its data warehouse (SAP BW), it can still add value to extract operational SAP data from the SAP sources (S4/Hana or ECC) or SAP BW into an Azure Databricks Lakehouse – integrating it with other ingested data, which may not originate from SAP.

As a practical example, let’s consider a large and globally operating manufacturer that utilizes SAP for its production planning and supply chain management. It must integrate IoT machinery telemetry data stored outside the SAP system with supply chain data in SAP. This is a common use case, and we will show how to design a data integration solution for these challenges.

We will start with a more common pattern to access and ingest SAP BW data into the Lakehouse by using Azure Databricks and Azure Data Factory/ Axure Synapse Pipelines.

This first part will also lay important foundations to better illustrate the material presented. For example, imagine a company like a manufacturer mentioned earlier currently operating an SAP System on-premises with an SAP BW system. This company needs to copy some of its SAP BW objects to a Databricks Lakehouse in an Azure ADLS Gen 2 account and use Databricks to process, model, and analyze it.

There will likely be many questions for the scenario mentioned with the import from SAP BW, but the main questions we would like to focus on are:

  • How should we copy this data from the SAP BW to the Lakehouse?
  • What services and connectivity are going to be required?
  • What connector options are available to us?
  • Can we use a reusable pattern that is easy to implement for similar needs?

The reality is that not every company has the same goals, needs, and skillsets, so there isn’t one solution that fits all needs and customers.

SAP Data Connection and Extraction: What to Consider

We can connect and extract many ways, tools, and approaches, respectively, pull and push SAP Data from its different layers (Database tables, ABAB Layer, CDS Views, IDOCS, etc.) into the Databricks Lakehouse. For instance, we can use SAP tooling such as SAP SLT or other SAP tooling or another third-party tooling.

It makes sense for customers already in Azure to use the different sets of connectors provided in Azure Data Factory/ Azure Synapse to connect from Azure to the respective SAP BW instance running on-premises or in some other data center.

Depending on the different ways and tools of connecting, SAP licensing also plays an important role and needs to be considered. Also, we highly recommend an incremental stage and load approach, especially for large tables or tables with many updates.

It may be obvious, but in our experience, each organization has a unique set of needs and requirements when it comes to accessing SAP Data, and an approach that works for one organization may not be a good fit for another given its needs, requirements, and available skills. Some important factors to ensure smooth operations are licensing, table size, acceptable ingestion latency, expected operational cost, express route throughput, and size and number of nodes of the integrated runtime.

Let us look at the available connectors in ADF/ Synapse and when to use each.

How to Use the Right SAP Connector for Databricks and ADF/ Synapse

Right SAP Connector for Databricks & ADF/ Synapse

As of May 2023, ADF & Synapse currently provides seven different types of connectors for SAP-type sources. The image below shows all of them filtered by the term “SAP” in the linked services dialog in the pipeline designer. 

Different types of connectors for SAP-type sources

Now, we would like to extract from SAP BW. We can use options 1, 2, 3, and 7 for that task. As mentioned earlier, for larger tables or tables with many updates, we recommend options 1, 7, or option 3. The SAP CDC connector (Option 3) is still in public preview, and we recommend not using it in production until it is labeled “generally available.” However, the SAP CDC connector can be utilized in lower environments (Dev, QA, etc.). Option 2 will not be fast enough and likely take too much time to move the data.

There’s certainly more to be understood about these ADF connectors and approaches of directly connecting Azure Databricks to relevant SAP sources and data push approaches initiated by SAP tooling such as SAP SLT. But, for now, the pattern introduced in Azure using connectors in ADF and Synapse is very common and reliable.

SAP ODP: What You Should Know

So far, only the SAP CDC connector from above is fully integrated with SAP ODP (Operational Data Provisioning). As of May 2023, we may still use other connectors, given that all other connectors are stable and have been in production already for years. However, it is recommended to slowly plan for more use of ODP connectors, especially when it is a green field development. The project is about to start within the next month or so. So let us look closely at SAP ODP and its place in the process.

SAP ODP
Image 2: SAP ODP Role and Architecture

As shown in Image 2, SAP ODP-based connector is like a broker between the different SAP data providers and the SAP Data Consumers. An SAP Data Consumer like ADF, for instance, does not directly connect to an SAP data provider once connected via SAP ODP connector. Instead, it connects to the SAP ODP Service and sends a request to provide data to the service. The service then connects to the respective data provider to serve the requested data to the data consumer.

Depending on the configuration of the connection and the respective data provider, data will be served as an incremental or full load and stored in ODP internal queues until a consumer like ADF can fetch it from the queues and write it back out to the Databricks Lakehouse.

SAP ODP-based connectors are the newer approach to accessing SAP data, and as of May 2023, the SAP ODP-based CDC connector is in public preview. SAP ODP isn’t new within the SAP world and has been used internally by SAP for data migration for many years.

One major advantage is that the SAP ODP CDC connector provides a delta mechanism built into ODP, and developers no longer have to maintain their watermarking logic when using this connector. But, given that it isn’t yet generally available, it should be applied with care and, at this stage, possibly just planned for.

Obviously, we also recommend testing the watermarking mechanism to ensure it fits your specific scenarios.

How to Put It All Together with Azure Databricks

Now, we are ready to connect all the parts and initially land the data into a landing zone within the Lakehouse. This process is shown in Image 3 below. But, before that step, we ingested the SAP operational data from its applications and backend databases (HANA DB, SQL Server, or Oracle Server) to SAP BW. 

Later, we will also look at these SAP BW ingestion jobs and how to migrate their logic to Azure Databricks to be able to refactor them and apply these transformations on top of the source tables imported from SAP ECC or S4/ Hana to the Databricks Lakehouse.

We need to set up an Integrate Runtime for ADF to connect to SAP BW via the SAP BW Open Hub. ADF requires this service to connect to SAP BW. We also need to install the proper SAP BW drivers on the SAP BW server. Typically, in a production environment, the Integrated Runtime is installed on separate machines – hosted in Azure or on-premises and uses at least two nodes for higher levels of availability. 

Up to four nodes are possible in a cluster, which is an important consideration for large table loads to set up like that to account for any performance SLAs.

Multiple integrated runtimes will be required for organizations with very large SAP deployments and SAP data ingestion needs. We highly recommend careful capacity and workload planning to ensure the clusters are properly sized and the type of connectivity between Azure and on-premises is sufficient, assuming SAP BW runs on-premises.

From the landing zone, which is just a copy of the original data with typically some batch ID added to the flow, we can utilize the power of the Databricks Delta Lake and Databricks Notebooks containing the code with many language options available (Spark SQL, Python, Scala are the most common choices) to process further and “model” the data from the landing zone into the respective next zones bronze, silver and finally into the gold layer.

Please note this is a slightly simplified view. Typically, the Data Ingestion is part of an overall Data Processing Framework that provides metadata information on tables and sources to be ingested and the type of ingestion desired, such as full or incremental, just as an example. These implementation details aren’t in scope here, but please remember them.

For secure deployment in Azure, every Azure Service part of the solution should be configured to use private endpoints, and no traffic would ever traverse the public Internet. All the services must be correctly configured since this private endpoint configuration is not the default setup with PaaS Services in Azure.

Complete Dataflow from SAP BW to the Azure Lakehouse
Image 3: Complete Dataflow from SAP BW to the Azure Lakehouse

We recommend integrating data test steps typically executed before the data lands in the gold layer. Tiger’s Data Processing Framework (Tiger Data Fabric) provides a semi-automated process to utilize AI capabilities from Databricks to mature data quality through interaction with a data subject matter expert.

Typically, the Databricks Lakehouse has a clearly defined folder structure within the different layers, and plenty of documentation is available on how to do so. But at a higher level, the bronze and silver layer is typically organized by source systems or areas in contrast to the gold layer, which is typically organized towards consumption aspects and models the data by expected downstream consumption needs.

Why Use Azure Databricks

1. Smooth integration of SAP data with data originating from other sources.

2. Databricks’ complete arsenal of advanced ML algorithms and strong data processing capabilities compared to the limited availability of advanced analytics in SAP BW.

3. Highly customizable data processing pipelines – easier to enrich the data from bronze and silver to the gold layer in the Lakehouse.

4. Significantly lower total cost of ownership.

5. Easy-to-scale for large data volume processing.

6. Industry-standard and mature for large-scale batch and low-latency event processing.

Undoubtedly, effectively leveraging Azure Databricks ensures that organizations can harness the power of SAP data analytics. We hope this article provided some insights on how you can enable SAP data analytics on Azure Databricks and integrate it with your SAP systems. 

After all, this integration can empower your organization to work with a robust and scalable platform that allows for efficient processing and analysis of large data volumes. 

The post A Complete Guide to Enabling SAP Data Analytics on Azure Databricks appeared first on Tiger Analytics.

]]>
https://www.tigeranalytics.com/perspectives/blog/a-complete-guide-to-enabling-sap-data-analytics-on-azure-databricks/feed/ 316
Harmonizing Azure Databricks and Azure Synapse for Enhanced Analytics https://www.tigeranalytics.com/perspectives/blog/harmonizing-azure-databricks-and-azure-synapse-for-enhanced-analytics/ https://www.tigeranalytics.com/perspectives/blog/harmonizing-azure-databricks-and-azure-synapse-for-enhanced-analytics/#comments Thu, 22 Jun 2023 14:32:44 +0000 https://www.tigeranalytics.com/?p=12692 Explore integrating Azure Databricks and Azure Synapse for advanced analytics. This guide covers selecting Azure services, unifying databases into a Lakehouse, large-scale data processing, and orchestrating ML training. Discover orchestrating pipelines and securely sharing business data for flexible, maintainable solutions.

The post Harmonizing Azure Databricks and Azure Synapse for Enhanced Analytics appeared first on Tiger Analytics.

]]>
Azure Databricks and Azure Synapse are powerful analytical services that complement each other. But choosing the best-fit analytical Azure services can be a make-or-break moment. When done right, it ensures fulfilling end-user experiences while balancing maintainability, cost-effectiveness, security, etc.

So, let’s find out the considerations to pick the right one so that it helps deliver a complete solution to bridge some serious, prevalent gaps in the world of data analytics.

Delivering Tomorrow’s Analytical Needs Today

Nowadays, organizations’ analytical needs are vast and quite demanding in many aspects. For organizations of any size, it is vital to invest in platforms that deliver to these needs and are open enough, secure, cost-effective, and extendible. Some of these needs may include complex, time-consuming tasks such as:

  • Integrating up to 500 single databases representing different source systems located in different domains and clouds into a single location, a Lakehouse, for instance.
  • Processing terabytes or petabytes of data in batches and more frequently in near real-time.
  • Training Machine Learning models on large datasets.
  • Quickly performing explorative data analysis with environments provisioned on the fly.
  • Quickly visualizing some business-related data and easily share it with a consumer community.
  • Executing and monitoring thousands of pipelines daily in a scalable and cost-effective manner.
  • Providing self-service capabilities around data governance.
  • Self-optimizing queries over time.
  • Integrating a comprehensive data quality processing layer into the Lakehouse processing.
  • Easily sharing critical business data securely with peers or partner employees.

Why Azure Databricks Is Great for Enterprise-Grade Data Solutions

According to Microsoft Docs, Azure Databricks is defined as a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. The Azure Databricks Lakehouse Platform integrates cloud storage and security in your cloud account and manages and deploys cloud infrastructure on your behalf.

From a developer’s point of view, Azure Databricks makes it easy to write Python, Scala, and SQL code and execute this code on a cluster to process the data – with many different features. We recommend reviewing the “used for” section of the above link for further details.

Azure Databricks originates from Apache Spark but has many specific optimizations that the open-source version of Spark doesn’t provide. For example, the Photon engine can speed up processing by up to 30 % without code optimization or refactoring.

Initially, Azure Databricks was more geared toward data scientists and ML workloads. However, over time Databricks added data engineering and general data analytics capabilities to the platform. It provides metadata management via a tool called “Unity,” which is a part of the Databricks platform. Azure Databricks also provides a data-sharing feature allowing secure data sharing across company boundaries.

Azure Databricks is extremely effective in ML processing with an enormous amount of ML libraries built in and its Data Engineering/ Lakehouse processing capabilities. Languages such as Python, Scala, and SQL are popular among data professionals – providing many APIs to interact and process data into any desired output shape.

Azure Databricks provides Delta Live Tables for developers to generate ingestion and processing pipelines with significantly lower effort. So, it is a major platform bound to see wider adoption as an integral part of any large-scale analytical platform.

How Azure Synapse Speeds Up Data-Driven Time-To-Insights 

According toMicrosoft Docs, Azure Synapse is defined as an enterprise analytics service that accelerates time-to-insight across data warehouses and big data systems. Azure Synapse brings together the following:

  • The best SQL technologies used in enterprise data warehousing.
  • Spark technologies used for big data.
  • Data Explorer for log and time series analytics.
  • Pipelines for data integration and ETL/ELT.
  • Deep integration with other Azure services such as Power BI, Cosmos DB & Azure ML.

Azure Synapse integrates several independent services like Azure Data Factory, SQL DW, Power BI, and others under one roof, called Synapse Studio. From a developer’s point of view, Azure Synapse Studio provides the means to write Synapse Pipelines and SQL scripts and execute this code on a cluster to process the data. It also easily integrates many other Azure Services into the development process.

Due to its deep integration with Azure, Azure Synapse effortlessly allows using other related Azure Services, such as Azure Cognitive Services and Cosmos DB. Architecturally, this is important since easy integration of capabilities is a critical criterion when considering platforms.

Azure Synapse shines in the areas of data and security integration. If existing workloads already use many other related Azure Services like Azure SQL, then integration is likely easier than other solutions. Synapse Pipelines can also act as an orchestration layer to invoke other compute solutions within Synapse or Azure Databricks.

This integration from Synapse Pipelines to invoke Databricks Notebooks will be a key area to review further in the next section.

It is vital to note that an integrated runtime is required for Synapse Pipelines to access on-premises resources or resources behind a firewall. This integrated runtime acts as an agent – enabling pipelines to access the data and copy them to a destination defined in the pipeline.

Azure Databricks and Azure Synapse: Better Together

As mentioned earlier (and shown in Image 1), Databricks Notebooks and the code included in the Notebooks Spark-SQL, Python, or Scala) can be invoked through ADF/Synapse Pipelines and therefore orchestrated. It is where Databricks and Synapse sync up great. In image 1, we show how a Synapse Pipeline looks like that moves data from Bronze to Silver.

When completed, it continues to process the data into the gold layer. It is just a basic pattern, and many more patterns can be implemented to increase the reuse and flexibility of the pipeline. 

Synapse Pipeline invoking Databricks Notebooks
Image 1: Synapse Pipeline invoking Databricks Notebooks

For instance, we can use parameters supplied from a configuration database (Azure SQL or similar) and have Synapse Pipelines pass the parameters to the respective Databricks Notebooks. It is for the parametrized execution of Notebooks – allowing for code reuse and reducing the time required to implement the solution.

Furthermore, the configuration database can supply source system connections and destinations such as databases or Databricks Lakehouses at runtime.

It is also possible to break down a large pipeline into multiple pieces and work with a Parent-Child pattern. The complete pipeline of several Parent—Child patterns could exist, one for each layer, just as an example. Defining these structures at the implementation’s beginning is vital to a maintainable and cost-effective system in the long run. Further abstractions can be added to increase code reuse and integrate a structured and effective testing framework. 

While it is an additional effort to set up Azure Service (Databricks and Synapse), we recommend it as a good investment – especially for larger-scale analytical projects dealing with DE or ML-based workloads. 

Also, providing technical implementation teams with options regarding the tooling and language the team would prefer for a given task. Typically, it positively impacts the timeline while reducing implementation risks.

Final Thoughts

You can easily take the ideas and concepts described here to build a metadata-driven data ingestion system based on Azure Databricks and Synapse.

These discussed concepts can also be applied to ML workloads using Databricks with ML Flow and Azure Synapse, and Azure ML. 

Also, integrating Databricks Lakehouse and Unity is another crucial consideration in designing these solutions.

We hope this article gave some necessary insights on the power of Azure Databricks and Azure Synapse – and how they can be used to deliver modularized, flexible, and maintainable data ingestion and processing solutions.

The post Harmonizing Azure Databricks and Azure Synapse for Enhanced Analytics appeared first on Tiger Analytics.

]]>
https://www.tigeranalytics.com/perspectives/blog/harmonizing-azure-databricks-and-azure-synapse-for-enhanced-analytics/feed/ 304
How to Implement ML Models: Azure and Jupyter for Production https://www.tigeranalytics.com/perspectives/blog/how-to-implement-ml-models-azure-and-jupyter-for-production/ Thu, 28 May 2020 20:46:56 +0000 https://www.tigeranalytics.com/blog/how-to-implement-ml-models-azure-and-jupyter-for-production/ Learn how to implement Machine Learning models using Azure and Jupyter for production environments - from model development to deployment, including environment setup, training, and real-time predictions. Understand the advantages of using Azure's robust infrastructure and Jupyter's flexible interface to streamline the entire process.

The post How to Implement ML Models: Azure and Jupyter for Production appeared first on Tiger Analytics.

]]>
Introduction

As Data Scientists, one of the most pressing challenges we have is how to operationalize machine learning models so that they are robust, cost-effective, and scalable enough to handle the traffic demand. With advanced cloud technologies and serverless computing, there are now cost-effective (pay based on usage) and auto-scalable platforms (with scale-in/scale-out architecture depending on the traffic) available. Data scientists can use these to accelerate the machine learning model deployment without having to worry about the infrastructure.

This blog discusses one such methodology of implementing the machine learning code and model developed locally using Jupyter notebook in the Azure environment for real-time predictions.

ML Implementation Architecture

ML Implementation Architecture on Azure

ML Implementation Architecture

We have used Azure Functions to deploy the Model Scoring and Feature Store Creation codes into production. Azure Functions is a FaaS offering (Function as a Service or FaaS provides event-based, serverless computing to accelerate development without having to worry about the infrastructure). Azure Functions comes with some interesting functionalities like-

1. Choice of Programming Languages

You can work with any language of your choice- C#, Node.js, Java, Python

2. Event-driven and Scalable

You can use built-in triggers and bindings such as http trigger, event trigger, timer trigger, and queue trigger to define when a function is invoked. The architecture is scalable, depending on the workload.

ML Implementation process

Once the code is developed, the following are the best practices to make the machine learning code production-ready. Below are the steps to deploy the Azure Function.

ML Implementation Process

ML Implementation Process

Azure Function Deployment Steps Walkthrough

Visual Studio Code editor with Azure Function extension is used to create a serverless HTTP endpoint with Python.

1. Sign in to Azure

sign into azure

2. Create a New Project. In the prompt that shows up, select the Language as Python, Trigger as http trigger (based on the requirement)

create new project

3. Azure Function is created, and the folder structure is as follows. Write your logic or copy the code if already developed into __init__.py

azure function folder structure

4. function.json, triggered by http trigger, defines the bindings in this case

function.json

5. local settings.json contains all the environmental variables used in the code as a key-value pair

settings.json

6. requirements.txt contains all libraries that need to be pip installed

requirements

7. As the model is stored in Blob, add the following line of code to read from Blob

blob

8. Read the Feature Store data from Azure SQL DB

feature store data

9. Test locally. Choose Debug -> Start Debugging; it will run locally and give a local API endpoint

debug

10. Publish to Azure Account using the following

func azure functionapp publish functionappname functionname — build remote — additional-packages “python3-dev libevent-dev unixodbc-dev build-essential libssl-dev libffi-dev”

publish

11. Log in to Azure Portal and go to Azure Functions resource to get the API endpoint for Model Scoring

azure portal

Conclusion

This API can also be integrated with front-end applications for real-time predictions.

Happy Learning!

The post How to Implement ML Models: Azure and Jupyter for Production appeared first on Tiger Analytics.

]]>