Apache Spark Archives - Tiger Analytics Thu, 16 Jan 2025 09:58:26 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://www.tigeranalytics.com/wp-content/uploads/2023/09/favicon-Tiger-Analytics_-150x150.png Apache Spark Archives - Tiger Analytics 32 32 Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security https://www.tigeranalytics.com/perspectives/blog/invisible-threats-visible-solutions-integrating-aws-macie-and-tiger-data-fabric-for-ultimate-security/ Thu, 07 Mar 2024 07:03:07 +0000 https://www.tigeranalytics.com/?post_type=blog&p=20726 Data defenses are now fortified against potential breaches with the Tiger Data Fabric-AWS Macie integration, automating sensitive data discovery, evaluation, and protection in the data pipeline for enhanced security. Explore how to integrate AWS Macie into a data fabric.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

]]>
Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing.  Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

  • Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
  • Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
  • Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
  • Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

  • Automate the discovery of sensitive data.
  • Discover a variety of sensitive data types.
  • Evaluate and monitor data for security and access control.
  • Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

  • S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
  • Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
    • triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
    • pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
    • checkJobStatus: This Lambda function checks the status of the Macie scan job.
    • isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
    • waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
    • UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.
  • A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
    • We create a one-time Macie job through the triggerMacieJob Lambda function.
    • We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
    • While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.

      The findings can be seen on the AWS console and filtered based on S3 Buckets.  We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.

  • The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
    • In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
    • In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.

      For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848

      The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.

    • If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
    • In the quarantine bucket, the same file is being moved after finding the sensitive data.

      If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
  • Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
    • GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
    • StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
    • StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
    • StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
  • IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

  • Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
  • Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
  • For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
  • Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

]]>
Automating Data Quality: Using Deequ with Apache Spark https://www.tigeranalytics.com/perspectives/blog/automating-data-quality-using-deequ-with-apache-spark/ Thu, 03 Sep 2020 10:09:56 +0000 https://www.tigeranalytics.com/blog/automating-data-quality-using-deequ-with-apache-spark/ Get to know how to automate data quality checks using Deequ with Apache Spark. Discover the benefits of integrating Deequ for data validation and the steps involved in setting up automated quality checks for improving data reliability in large-scale data processing environments.

The post Automating Data Quality: Using Deequ with Apache Spark appeared first on Tiger Analytics.

]]>
Introduction

When dealing with data, the main factor to be considered is the quality of data. Especially in the Big data environment, data quality is critical. Having inaccurate or flawed data will produce incorrect results in data analytics. Many developers test the data manually before training their model with the available data. This is time-consuming, and there are possibilities of committing mistakes.

Deequ

Deequ is an open-source framework for testing the data quality. It is built on top of Apache Spark and is designed to scale up to large data sets. Deequ is developed and used at Amazon for verifying the quality of many large production datasets. The system computes data quality metrics regularly, verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success.

Deequ allows us to calculate data quality metrics on our data set, and also allows us to define and verify data quality constraints. It also specifies what constraint checks are to be made on your data. There are three significant components in Deequ. These are Constraint Suggestion, Constraint Verification, and Metrics Computation. This is depicted in the image below.

deequ

Deequ Components

This blog provides a detailed explanation of these three components with the help of practical examples.

1. Constraint Verification: You can provide your own set of data quality constraints which you want to verify on your data. Deequ checks all the provided constraints and gives you the status of each check.

2. Constraint Suggestion: If you are not sure of what constraints to test on your data, you can use Deequ’s constraint suggestion. Constraint Suggestion provides a list of possible constraints you can test on your data. You can use these suggested constraints and pass it to Deequ’s Constraint Verification to check your data quality.

3. Metrics Computation: You can also compute data quality metrics regularly.

Now, let us implement these with some sample data.

For this example, we have downloaded a sample csv file with 100 records. The code is run using Intellij IDE (you can also use Spark Shell).

Add Deequ library

You can add the below dependency in your pom.xml

<dependency>

<groupId>com.amazon.deequ</groupId>

<artifactId>deequ</artifactId>

<version>1.0.2</version>

</dependency>

If you are using Spark Shell, you can download the Deequ jar as shown below-

wget

https://repo1.maven.org/maven2/com/amazon/deequ/deequ/1.0.1/deequ-1.0.2.jar

Now, let us start the Spark session and load the csv file to a dataframe.

val spark = SparkSession.builder()

.master(“local”)

.appName(“deequ-Tests”).getOrCreate()

val data = spark.read.format(“csv”)

.option(“header”,true)

.option(“delimiter”,”,”)

.option(“inferschema”,true)

.load(“C:/Users/Downloads/100SalesRecords.csv”)

data.show(5)

The data has now been loaded into a data frame.

data.printSchema

Schema

Note

Ensure that there are no spaces in the column names. Deequ will throw an error if there are spaces in the column names.

Constraint Verification

Let us verify our data by defining a set of data quality constraints.

Here, we have given duplicate check(isUnique), count check (hasSize), datatype check(hasDataType), etc. for the columns we want to test.

We have to import Deequ’s verification suite and pass our data to that suite. Only then, we can specify the checks that we want to test on our data.

import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame

import com.amazon.deequ.checks.{Check, CheckLevel}

import com.amazon.deequ.constraints.ConstrainableDataTypes

import com.amazon.deequ.{VerificationResult, VerificationSuite}//Constraint verification

val verificationResult: VerificationResult = {

VerificationSuite()

.onData(data) //our input data to be tested

//data quality checks

.addCheck(

Check(CheckLevel.Error, “Review Check”)

.isUnique(“OrderID”)

.hasSize(_ == 100)

.hasDataType(“UnitPrice”,ConstrainableDataTypes.String)

.hasNumberOfDistinctValues(“OrderDate”,_>=5)

.isNonNegative(“UnitCost”))

.run()

}

On successful execution, it displays the below result. It will show each check status and provide a message if any constraint fails.

Using the below code, we can convert the check results to a data frame.

//Converting Check results to a DataFrame

val verificationResultDf = checkResultsAsDataFrame(spark, verificationResult)

verificationResultDf.show(false)

Constraint Verification

Constraint Suggestion

Deequ can provide possible data quality constraint checks to be tested on your data. For this, we need to import ConstraintSuggestionRunner and pass our data to it.

import com.amazon.deequ.checks.{Check, CheckLevel}

import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}//Constraint Suggestion

val suggestionResult = {

ConstraintSuggestionRunner()

.onData(data)

.addConstraintRules(Rules.DEFAULT)

.run()

}

We can now investigate the constraints that Deequ has suggested.

import spark.implicits._

val suggestionDataFrame = suggestionResult.constraintSuggestions.flatMap {

case (column, suggestions) =>

suggestions.map { constraint =>

(column, constraint.description, constraint.codeForConstraint)

}

}.toSeq.toDS()

On successful execution, we can see the result as shown below. It provides automated suggestions on your data.

suggestionDataFrame.show(4)

Constraint Suggestion

You can also pass these Deequ suggested constraints to VerificationSuite to perform all the checks provided by SuggestionRunner. This is illustrated in the following code.

val allConstraints = suggestionResult.constraintSuggestions

.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}

.toSeq

val generatedCheck = Check(CheckLevel.Error, “generated constraints”, allConstraints)//passing the generated checks to verificationSuite

val verificationResult = VerificationSuite()

.onData(data)

.addChecks(Seq(generatedCheck))

.run()

val resultDf = checkResultsAsDataFrame(spark, verificationResult)

Running the above code will check all the constraints that Deequ suggested and provide the status of each constraint check, as shown below.

resultDf.show(4)

Metrics Computation

You can compute metrics using Deequ. For this, you need to import AnalyzerContext.

import com.amazon.deequ.analyzers._

import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame

import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}//Metrics Computation

val analysisResult: AnalyzerContext = {

AnalysisRunner

// data to run the analysis on

.onData(data)

// define analyzers that compute metrics .addAnalyzers(Seq(Size(),Completeness(“UnitPrice”),ApproxCountDistinct(“Country”),DataType(“ShipDate”),Maximum(“TotalRevenue”)))

.run()

}

val metricsDf = successMetricsAsDataFrame(spark, analysisResult)

Once the run is successful you can see results as below.

metricsDf.show(false)

Metrics Computation

You can also store the metrics that you computed on your data. For this, you can use Deequ’s metrics repository. To know more about this repository, click here.

Conclusion

Overall, Deequ has many advantages. We can calculate data metrics, define, and verify data quality constraints. Even large datasets that consist of billions of rows of data can be easily verified using Deequ. Apart from Data quality checks, Deequ also provides Anamoly Detection and Incremental metrics computation.

The post Automating Data Quality: Using Deequ with Apache Spark appeared first on Tiger Analytics.

]]>
Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms https://www.tigeranalytics.com/perspectives/blog/spark-snowflake-connector-in-depth-analysis-of-internal-mechanisms/ Thu, 25 Jun 2020 10:50:56 +0000 https://www.tigeranalytics.com/blog/spark-snowflake-connector-in-depth-analysis-of-internal-mechanisms/ Examine the internal workings of the Spark-Snowflake Connector with a clear breakdown of how the connector integrates Apache Spark with Snowflake for enhanced data processing capabilities. Gain insights into its architecture, key components, and techniques for seamlessly optimizing performance during large-scale data operations.

The post Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms appeared first on Tiger Analytics.

]]>
Introduction

The Snowflake Connector for Spark enables the use of Snowflake as a Spark data source – similar to other data sources like PostgreSQL, HDFS, S3, etc. Though most data engineers use Snowflake, what happens internally is a mystery to many. But only if one understands the underlying architecture and its functioning, can they figure out how to fine-tune their performance and troubleshoot issues that might arise. This blog thus aims to explaining in detail the internal architecture and functioning of the Snowflake Connector.

Before getting into the details, let us understand what happens when one does not use the Spark-Snowflake Connector.

Data Loading to Snowflake without Spark- Snowflake Connector

Create Staging Area -> Load local files -> Staging area in cloud -> Create file format -> Load to Snowflake from staging area using the respective file format

Loading Data from Local to Snowflake

Data Loading using Spark-Snowflake Connector

When we use the Spark Snowflake connector to load the data into Snowflake, it does a lot of things that are abstract to us. The connector takes care of all the heavy lifting tasks.

Spark Snowflake Connector

Spark Snowflake Connector (Source:https://docs.snowflake.net/manuals/user-guide/spark-connector-overview.html#interaction-between-snowflake-and-spark)

This blog illustrates one such example where the Spark-Snowflake Connector is used to read and write data in Databricks. Databricks has integrated the Snowflake Connector for Spark into the Databricks Unified Analytics Platform to provide native connectivity between Spark and Snowflake.

The Snowflake Spark Connector supports Internal (temp location managed by Snowflake automatically) and External (temp location for data transfer managed by user) transfer modes. Here is a brief description of the two modes of transfer-

Internal Data Transfer

This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

External Data Transfer

External data transfer, on the other hand, is facilitated through a storage location that the user specifies. The storage location must be created and configured as part of the Spark connector installation/configuration.

Further, the files created by the connector during external transfer are intended to be temporary, but the connector does not automatically delete these files from the storage location. This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

Use Cases

Below are the use cases we are going to run on Spark and see how the Spark Snowflake connector works internally-

1. Initial Loading from Spark to Snowflake

2. Loading the same Snowflake table in Overwrite mode

3. Loading the same Snowflake table in Append mode

4. Read the Snowflake table

Snowflake Connection Parameters

Snowflake Connection Parameters

1. Initial Loading from Spark to Snowflake

When a new table is loaded for the very first time from Spark to Snowflake, the following command will be running on Spark. This command, in turn, starts to execute a set of SQL queries in Snowflake using the connector.

spark snowflake overwrite mode

The single Spark command above triggers the following 9 SQL queries in Snowflake

Snowflake Initial Load Query History

Snowflake Initial Load Query History

i) Spark, by default, uses the local time zone. This can be changed by using the sfTimezone option in the connector

ii) The below query creates a temporary internal stage in Snowflake. We can use other cloud providers that we can configure in Spark.

iii) The GET command downloads data files from one of the following Snowflake stages to a local directory/folder on a client machine. We have metadata checks at this stage.

iv) The PUT command uploads (i.e., stages) data files from a local directory/folder onto a client machine to one of the Snowflake strategies

v) The DESC command failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It won’t throw any error in the Spark Job

vi) The IDENTIFIER keyword is used to identify objects by name, using string literals, session variables, or bind variables.

vii) The command below loads data into Snowflake’s temporary table to maintain consistency. By doing so, Spark follows Write All or Write Nothing architecture.

viii) The DESC command below failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It didn’t stop the process. The metadata check at this point is to see whether to use RENAME TO or SWAP WITH

ix) The RENAME TO command renames the specified table with a new identifier that is not currently used by any other table in the schema. If the table is already present, we have to use SWAP WITH and then drop the identifier

2. Loading the same Snowflake table in Overwrite mode

The above Spark command triggers the following 10 SQL queries in Snowflake. This time there is no failure when we ran the overwrite command a second time because this time, the table already exists.

i) We have metadata checks at the internal stages

ii) The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. It also swaps all access control privilege grants. The two tables are essentially renamed in a single transaction.

The RENAME TO command is used when the table is not present because it is faster than renaming and dropping the intermediate table. But this can only be used when the table does not exist in Snowflake. This means that RENAME TO is only performed during the Initial Load.

iii) The DROP command drops the intermediate staging table

3. Loading the same Spark table in Append mode

The above Spark command triggers the following 7 SQL queries in Snowflake.

Note: When we use OVERWRITE mode, the data is copied into the intermediate staged table, but during APPEND, the data is loaded into the actual table in Snowflake.

i) In order to maintain the ACID compliance, this mode uses all the transactions inside the BEGIN and COMMIT. If anything goes wrong, it uses ROLLBACK so that the previous state of the table is untouched.

4. Reading the Snowflake Table

The above Spark command triggers the following SQL query in Snowflake. The reason for this is that Spark follows the Lazy Execution pattern. So until an action is performed, it will not read the actual data. Spark internally maintains lineage to process it effectively. The following query is to check whether the table is present or not and to retrieve only the schema of the table.

The Spark action below triggers 5 SQL queries in Snowflake

i) First, it creates a temporary internal stage to load the read data from Snowflake.

ii) Next, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

iii) The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ.

iv) The data is then copied from Snowflake to the internal stage.

v) Finally, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Wrapping Up

Spark Snowflake connector comes with lots of benefits like query pushdown, column mapping, etc. This acts as an abstract layer and does a lot of groundwork in the back end.

Happy Learning!!

The post Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms appeared first on Tiger Analytics.

]]>