Big Data Archives - Tiger Analytics

Unlocking the Potential of Modern Data Lakes: Trends in Data Democratization, Self-Service, and Platform Observability

TA@2023 — Wed, 22 Jun 2022 12:53:21 +0000

Modern Data Lake

Data Lake solutions started emerging out of technology innovations such as Big Data but have been propelled by Cloud to a greater extent. The prevalence of Data Lake can be attributed to its ability in bringing better speed to data retrieval compared to Data Warehouses, the elimination of a significant amount of modeling effort, unlocking advanced analytics capabilities for an enterprise, and bringing in storage and compute scalability to handle different kinds of workloads and enable data driven decisions.

Data Democratization is one of the key outcomes that is sought after with the data platforms today. The need to bring reliable and trusted data in a self-service manner to end-users such as data analysts, data scientists, and business users, is the top priority of any data platform. This blog discusses the key trends we see with our clients and in the industry that are being created to aid data lakes cater to wider audiences and to increase their adoption with consumers.

Self-Service Management of Modern Data Lake

“…self-service is basically enabling all types of users (IT or business), to easily manage and govern the data on the lake themselves – in a low/no-code manner”

Building robust, scalable data pipelines is the first step in leveraging data to its full potential. However, it is the incorporation of automation and self-serving capabilities that really helps one achieve that goal. It will also help in democratizing the data, platform, and analytics capabilities to all types of users and reducing the burden on IT teams significantly, so they can focus on high-value tasks.

Building Self-Service Capabilities

Self-serving capabilities are built on top of robust, scalable data pipelines. Any data lake implementation will involve building various reusable frameworks and components for acquiring data from the storage systems (components that can understand and infer the schema, check data quality, implement certain functionalities when bringing in or transforming the data), and loading it into the target zone.

Data pipelines are built using these reusable components and frameworks. These pipelines ingest, wrangle, transform and perform the egress of the data. Stopping at this point would rob an organization of the opportunity to leverage the data to its full potential.

In order to maximize the result, APIs (following microservices architecture) for data and platform management are created to perform CRUD operations and for monitoring purposes. They can also be used to schedule and trigger pipelines, discover and manage datasets, cluster management, security, and user management. Once the APIs are set, you can build a web UI-based interface that can orchestrate all these operations and help any user to navigate it, bring in the data, transform the data, send out the data, or manage the pipelines.

Tiger has also taken self-servicing on Data Lake to another level by building a virtual assistant that interacts with the user to perform the above-mentioned tasks.

Data Catalog Solution

Another common trend we see in modern data lakes is the increasing adoption of next-gen Data Catalog Solutions. A Data Catalog Solution comes in handy when we’re dealing with huge volumes of data and multiple data sets. It can extract and understand technical metadata from different datasets, link them together, understand their health, reliability, and usage patterns, and help any consumer, whether it is a data scientist or a data analyst, or a business analyst, with insight generation.

Data Catalogs have been around for quite some time, but now they are becoming more information intelligent. It is no longer just about bringing in the technical metadata.

Data Catalog Implementation

Some of the vital parts of building a data catalog are using knowledge graphs and powerful search technologies. A knowledge graph solution can bring in the information of a dataset like schema, data quality, profiling statistics, PII, and classification. It can also figure out who’s the owner of the particular dataset, who are the users consuming this data set from various logs, and which department the person belongs to.

This knowledge graph can be used to carry out search and filter operations, graph queries, recommendations, and visual explorations.

Data and Platform Observability

Tiger looks at Observability in three different stages:

1. Basic Data Health Monitoring
2. Advanced Data Health Monitoring with Predictions
3. Extending to Platform Observability

Basic Data Health Monitoring

Identifying critical data elements (CDE) and monitoring them is the most basic aspect of data health monitoring. We configure rule-based data checks against these CDEs and capture the results on a periodic basis and provide visibility through dashboards. These issues are also tracked through ticketing systems and then fixed in source as much as possible. This process constitutes the first stage in ensuring Data Observability.

The key capabilities that are required to achieve this level of maturity are shown below

Advanced Data Health Monitoring with Predictions

Most of the enterprise clients we work with have reached the Basic Data Health Monitoring stage and are looking to progress forward. The observability ecosystem needs to be enhanced with some important capabilities that will help to move from a reactive response to a more proactive one. Artificial Intelligence and Machine Learning are the latest technologies being leveraged to this end. Some of the key capabilities include measuring the data drift and schema drift, classifying the incoming information automatically with AI/ML, detecting the PII information automatically and processing them appropriately, assigning security entitlements automatically based on similar elements in the data platform, etc. These capabilities will elevate the health of the data to the next level also giving early warnings when some data patterns are changing.

Extending to Platform Observability

The end goal of Observability Solutions is to deliver reliable data to consumers in a timely fashion. This goal can be achieved only when we move beyond data observability into the actual platform that delivers the data itself. This platform has to be modern and state of the art so that it can deliver the data in a timely manner while also allowing engineers and administrators to understand and debug if things are not going well. Following are some of the key capabilities that we need to think about to improve the platform-level observability.

Monitoring Data Flows & Environment: Ability to monitor job performance degradation, server health, and historical resource utilization trends in real-time
Monitor Performance: Understanding how data flows from one system to another and looking for bottlenecks in a visual manner would be very helpful in complex data processing environments
Monitor Data Security: Query logs, access patterns, security tools, etc need to be monitored in order to ensure there is no misuse of data
Analyze Workloads: Automatically detecting issues and constraints in large data workloads that make them slow and building tools for Root Cause Analysis
Predict Issues, Delays, and Find Resolutions: Comparing historical performance to current operational efficiency, in terms of speed and resource usage, to predict issues and offer solutions
Optimize Data Delivery: Building tools into the system that continuously adjust resource allocation based on data-volume spike predictions and thus optimizing TCO

Conclusion

The Modern Data Lake environment is driven by the value of data democratization. It is important to make data management and insight gathering accessible to end-users of all kinds. Self Service aided by Intelligent Data Catalogs is the most promising solution for effective data democratization. Moreover, enabling trust in the data for the data consumers is also of utmost importance. The capabilities discussed such as Data and Platform Observability gives users real under-the-hood control over onboarding, processing, and delivering data to different consumers. Companies are striving to create end-to-end observability solutions and wanting to enable data driven decisions today and these will be the solutions that will take data platforms to the next level of adoption and democratization.

We spoke more on this at the DES22 event. To catch our full talk, click here.

The post Unlocking the Potential of Modern Data Lakes: Trends in Data Democratization, Self-Service, and Platform Observability appeared first on Tiger Analytics.

From Awareness to Action: Private Equity’s Quest for Data-Driven Growth

onemg — Thu, 02 Dec 2021 16:42:44 +0000

Data has become the lifeblood of many industries as they unlock the immense potential to make smarter decisions. From retail and insurance to manufacturing and healthcare, companies are leveraging the power of big data and analytics to personalize and scale their products and services while unearthing new market opportunities. However, it has been proven that when the volume of data is high, and the touchpoints are unsynchronized, it becomes difficult to transform raw information into insightful business intelligence. Through this blog series, we will take an in-depth look at why data analytics continues to be an elusive growth strategy for Private Equity firms and how this can be changed.

State of the Private Equity (PE) industry

For starters, Private Equity (PE) firms have to work twice as hard to make sense of their data before turning them into actionable insights. This is because their client portfolios are often diverse, as is the data – spread across different industries and geographies, which limits the reusability of frameworks and processes. Furthermore, each client may have its own unique reporting format, which leads to information overflow.

Other data analytics-related challenges that PE firms have to overcome include:

No reliable sources and poor understanding of non-traditional data
Archaic and ineffective data management strategy
Inability to make optimal use of various data assets
Absence of analytics-focused functions, resources, and tools

These challenges offer a clear indication of why the adoption of data analytics in the PE industry has been low – compared to others. According to a recent study conducted by KPMG, only a few PE firms are currently exploring big data and analytics as a viable strategy, with “70% of surveyed firms still in the awareness-raising stage.”

Why PE firms need to incubate a data-first mindset

So, considering these herculean challenges, why is a data analytics strategy the need of the hour for Private Equity firms? After all, according to Gartner, “By 2025, more than 75% of venture capital (VC) and early-stage investor executive reviews will be informed using artificial intelligence (AI) and data analytics.”

First, it’s important to understand that as technology continues to skyrocket, a tremendous amount of relevant data is generated and gathered around the clock. And without leveraging data to unearth correlations and trends, they can only rely on half-truths and gut instincts. For instance, such outdated strategies can mislead firms regarding where their portfolio companies can reduce operating costs. Hence, the lack of a data analytics strategy means they can no longer remain competitive in today’s dynamic investment world.

Plus, stakeholders expect more transparency and visibility into the valuation processes. So, Private Equity firms are already under pressure to break down innovation barriers and enable seamless access and utilization of their data assets to build a decision-making culture based on actionable insights. They can also proactively identify good investment opportunities, which can significantly help grow revenue while optimizing the bandwidth of their teams by focusing on the right opportunities.

Some of the other benefits for PE firms are:

Enriched company valuation models
Enhanced portfolio monitoring
Reduced dependency on financial data
Pipeline monitoring and timely access for key event triggers
Stronger due diligence processes

Final thoughts

The emergence of data analytics as a game-changer for Private Equity firms has caused some to adopt piecemeal solutions – hoping that it could reap low-hanging fruits. However, this could prove to be hugely ineffective because it would further decentralize the availability of data, which has been this industry’s biggest problem in the first place.

In reality, the key is for Private Equity firms to rethink how they collect data and what they can do with it – from the ground up. There’s no doubt that only by building a data-led master strategy can they make a difference in how they make key investment decisions and successfully navigate a hyper-competitive landscape.

We hope that we helped you understand the current data challenges Private Equity firms face while adopting a data analytics strategy and why it’s still a competitive differentiator. Stay tuned for the next blog in the series, in which we will shed light on how Private Equity firms can overcome these challenges.

The post From Awareness to Action: Private Equity’s Quest for Data-Driven Growth appeared first on Tiger Analytics.

Automating Data Quality: Using Deequ with Apache Spark

onemg — Thu, 03 Sep 2020 10:09:56 +0000

Introduction

When dealing with data, the main factor to be considered is the quality of data. Especially in the Big data environment, data quality is critical. Having inaccurate or flawed data will produce incorrect results in data analytics. Many developers test the data manually before training their model with the available data. This is time-consuming, and there are possibilities of committing mistakes.

Deequ

Deequ is an open-source framework for testing the data quality. It is built on top of Apache Spark and is designed to scale up to large data sets. Deequ is developed and used at Amazon for verifying the quality of many large production datasets. The system computes data quality metrics regularly, verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success.

Deequ allows us to calculate data quality metrics on our data set, and also allows us to define and verify data quality constraints. It also specifies what constraint checks are to be made on your data. There are three significant components in Deequ. These are Constraint Suggestion, Constraint Verification, and Metrics Computation. This is depicted in the image below.

Deequ Components

This blog provides a detailed explanation of these three components with the help of practical examples.

1. Constraint Verification: You can provide your own set of data quality constraints which you want to verify on your data. Deequ checks all the provided constraints and gives you the status of each check.

2. Constraint Suggestion: If you are not sure of what constraints to test on your data, you can use Deequ’s constraint suggestion. Constraint Suggestion provides a list of possible constraints you can test on your data. You can use these suggested constraints and pass it to Deequ’s Constraint Verification to check your data quality.

3. Metrics Computation: You can also compute data quality metrics regularly.

Now, let us implement these with some sample data.

For this example, we have downloaded a sample csv file with 100 records. The code is run using Intellij IDE (you can also use Spark Shell).

Add Deequ library

You can add the below dependency in your pom.xml

com.amazon.deequ

deequ

1.0.2

If you are using Spark Shell, you can download the Deequ jar as shown below-

wget

https://repo1.maven.org/maven2/com/amazon/deequ/deequ/1.0.1/deequ-1.0.2.jar

Now, let us start the Spark session and load the csv file to a dataframe.

val spark = SparkSession.builder()

.master(“local”)

.appName(“deequ-Tests”).getOrCreate()

val data = spark.read.format(“csv”)

.option(“header”,true)

.option(“delimiter”,”,”)

.option(“inferschema”,true)

.load(“C:/Users/Downloads/100SalesRecords.csv”)

data.show(5)

The data has now been loaded into a data frame.

data.printSchema

Schema

Note

Ensure that there are no spaces in the column names. Deequ will throw an error if there are spaces in the column names.

Constraint Verification

Let us verify our data by defining a set of data quality constraints.

Here, we have given duplicate check(isUnique), count check (hasSize), datatype check(hasDataType), etc. for the columns we want to test.

We have to import Deequ’s verification suite and pass our data to that suite. Only then, we can specify the checks that we want to test on our data.

import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame

import com.amazon.deequ.checks.{Check, CheckLevel}

import com.amazon.deequ.constraints.ConstrainableDataTypes

import com.amazon.deequ.{VerificationResult, VerificationSuite}//Constraint verification

val verificationResult: VerificationResult = {

VerificationSuite()

.onData(data) //our input data to be tested

//data quality checks

.addCheck(

Check(CheckLevel.Error, “Review Check”)

.isUnique(“OrderID”)

.hasSize(_ == 100)

.hasDataType(“UnitPrice”,ConstrainableDataTypes.String)

.hasNumberOfDistinctValues(“OrderDate”,_>=5)

.isNonNegative(“UnitCost”))

.run()

}

On successful execution, it displays the below result. It will show each check status and provide a message if any constraint fails.

Using the below code, we can convert the check results to a data frame.

//Converting Check results to a DataFrame

val verificationResultDf = checkResultsAsDataFrame(spark, verificationResult)

verificationResultDf.show(false)

Constraint Verification

Constraint Suggestion

Deequ can provide possible data quality constraint checks to be tested on your data. For this, we need to import ConstraintSuggestionRunner and pass our data to it.

import com.amazon.deequ.checks.{Check, CheckLevel}

import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}//Constraint Suggestion

val suggestionResult = {

ConstraintSuggestionRunner()

.onData(data)

.addConstraintRules(Rules.DEFAULT)

.run()

}

We can now investigate the constraints that Deequ has suggested.

import spark.implicits._

val suggestionDataFrame = suggestionResult.constraintSuggestions.flatMap {

case (column, suggestions) =>

suggestions.map { constraint =>

(column, constraint.description, constraint.codeForConstraint)

}

}.toSeq.toDS()

On successful execution, we can see the result as shown below. It provides automated suggestions on your data.

suggestionDataFrame.show(4)

Constraint Suggestion

You can also pass these Deequ suggested constraints to VerificationSuite to perform all the checks provided by SuggestionRunner. This is illustrated in the following code.

val allConstraints = suggestionResult.constraintSuggestions

.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}

.toSeq

val generatedCheck = Check(CheckLevel.Error, “generated constraints”, allConstraints)//passing the generated checks to verificationSuite

val verificationResult = VerificationSuite()

.onData(data)

.addChecks(Seq(generatedCheck))

.run()

val resultDf = checkResultsAsDataFrame(spark, verificationResult)

Running the above code will check all the constraints that Deequ suggested and provide the status of each constraint check, as shown below.

resultDf.show(4)

Metrics Computation

You can compute metrics using Deequ. For this, you need to import AnalyzerContext.

import com.amazon.deequ.analyzers._

import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame

import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}//Metrics Computation

val analysisResult: AnalyzerContext = {

AnalysisRunner

// data to run the analysis on

.onData(data)

// define analyzers that compute metrics .addAnalyzers(Seq(Size(),Completeness(“UnitPrice”),ApproxCountDistinct(“Country”),DataType(“ShipDate”),Maximum(“TotalRevenue”)))

.run()

}

val metricsDf = successMetricsAsDataFrame(spark, analysisResult)

Once the run is successful you can see results as below.

metricsDf.show(false)

Metrics Computation

You can also store the metrics that you computed on your data. For this, you can use Deequ’s metrics repository. To know more about this repository, click here.

Conclusion

Overall, Deequ has many advantages. We can calculate data metrics, define, and verify data quality constraints. Even large datasets that consist of billions of rows of data can be easily verified using Deequ. Apart from Data quality checks, Deequ also provides Anamoly Detection and Incremental metrics computation.

The post Automating Data Quality: Using Deequ with Apache Spark appeared first on Tiger Analytics.

Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms

onemg — Thu, 25 Jun 2020 10:50:56 +0000

Introduction

The Snowflake Connector for Spark enables the use of Snowflake as a Spark data source – similar to other data sources like PostgreSQL, HDFS, S3, etc. Though most data engineers use Snowflake, what happens internally is a mystery to many. But only if one understands the underlying architecture and its functioning, can they figure out how to fine-tune their performance and troubleshoot issues that might arise. This blog thus aims to explaining in detail the internal architecture and functioning of the Snowflake Connector.

Before getting into the details, let us understand what happens when one does not use the Spark-Snowflake Connector.

Data Loading to Snowflake without Spark- Snowflake Connector

Create Staging Area -> Load local files -> Staging area in cloud -> Create file format -> Load to Snowflake from staging area using the respective file format

Data Loading using Spark-Snowflake Connector

When we use the Spark Snowflake connector to load the data into Snowflake, it does a lot of things that are abstract to us. The connector takes care of all the heavy lifting tasks.

Spark Snowflake Connector (Source:https://docs.snowflake.net/manuals/user-guide/spark-connector-overview.html#interaction-between-snowflake-and-spark)

This blog illustrates one such example where the Spark-Snowflake Connector is used to read and write data in Databricks. Databricks has integrated the Snowflake Connector for Spark into the Databricks Unified Analytics Platform to provide native connectivity between Spark and Snowflake.

The Snowflake Spark Connector supports Internal (temp location managed by Snowflake automatically) and External (temp location for data transfer managed by user) transfer modes. Here is a brief description of the two modes of transfer-

Internal Data Transfer

This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

External Data Transfer

External data transfer, on the other hand, is facilitated through a storage location that the user specifies. The storage location must be created and configured as part of the Spark connector installation/configuration.

Further, the files created by the connector during external transfer are intended to be temporary, but the connector does not automatically delete these files from the storage location. This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

Use Cases

Below are the use cases we are going to run on Spark and see how the Spark Snowflake connector works internally-

1. Initial Loading from Spark to Snowflake

2. Loading the same Snowflake table in Overwrite mode

3. Loading the same Snowflake table in Append mode

4. Read the Snowflake table

Snowflake Connection Parameters

1. Initial Loading from Spark to Snowflake

When a new table is loaded for the very first time from Spark to Snowflake, the following command will be running on Spark. This command, in turn, starts to execute a set of SQL queries in Snowflake using the connector.

The single Spark command above triggers the following 9 SQL queries in Snowflake

Snowflake Initial Load Query History

i) Spark, by default, uses the local time zone. This can be changed by using the sfTimezone option in the connector

ii) The below query creates a temporary internal stage in Snowflake. We can use other cloud providers that we can configure in Spark.

iii) The GET command downloads data files from one of the following Snowflake stages to a local directory/folder on a client machine. We have metadata checks at this stage.

iv) The PUT command uploads (i.e., stages) data files from a local directory/folder onto a client machine to one of the Snowflake strategies

v) The DESC command failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It won’t throw any error in the Spark Job

vi) The IDENTIFIER keyword is used to identify objects by name, using string literals, session variables, or bind variables.

vii) The command below loads data into Snowflake’s temporary table to maintain consistency. By doing so, Spark follows Write All or Write Nothing architecture.

viii) The DESC command below failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It didn’t stop the process. The metadata check at this point is to see whether to use RENAME TO or SWAP WITH

ix) The RENAME TO command renames the specified table with a new identifier that is not currently used by any other table in the schema. If the table is already present, we have to use SWAP WITH and then drop the identifier

2. Loading the same Snowflake table in Overwrite mode

The above Spark command triggers the following 10 SQL queries in Snowflake. This time there is no failure when we ran the overwrite command a second time because this time, the table already exists.

i) We have metadata checks at the internal stages

ii) The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. It also swaps all access control privilege grants. The two tables are essentially renamed in a single transaction.

The RENAME TO command is used when the table is not present because it is faster than renaming and dropping the intermediate table. But this can only be used when the table does not exist in Snowflake. This means that RENAME TO is only performed during the Initial Load.

iii) The DROP command drops the intermediate staging table

3. Loading the same Spark table in Append mode

The above Spark command triggers the following 7 SQL queries in Snowflake.

Note: When we use OVERWRITE mode, the data is copied into the intermediate staged table, but during APPEND, the data is loaded into the actual table in Snowflake.

i) In order to maintain the ACID compliance, this mode uses all the transactions inside the BEGIN and COMMIT. If anything goes wrong, it uses ROLLBACK so that the previous state of the table is untouched.

4. Reading the Snowflake Table

The above Spark command triggers the following SQL query in Snowflake. The reason for this is that Spark follows the Lazy Execution pattern. So until an action is performed, it will not read the actual data. Spark internally maintains lineage to process it effectively. The following query is to check whether the table is present or not and to retrieve only the schema of the table.

The Spark action below triggers 5 SQL queries in Snowflake

i) First, it creates a temporary internal stage to load the read data from Snowflake.

ii) Next, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

iii) The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ.

iv) The data is then copied from Snowflake to the internal stage.

v) Finally, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Wrapping Up

Spark Snowflake connector comes with lots of benefits like query pushdown, column mapping, etc. This acts as an abstract layer and does a lot of groundwork in the back end.

Happy Learning!!

The post Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms appeared first on Tiger Analytics.

Koalas Library: Integrating Pandas with PySpark for Data Handling

onemg — Tue, 25 Feb 2020 15:36:56 +0000

While working with small datasets, Pandas is typically the best option, but when it comes to larger ones, Pandas doesn’t suffice as it loads all the data into a single machine for processing. With large datasets, you will need the power of distributed computation. In such a situation, one of the ideal options available is PySpark – but this comes with a catch. PySpark syntax is complicated to both learn and to use.

This is where the Koalas package introduced within the Databricks Open Source environment has turned out to be a game-changer. It helps those who want to make use of distributed Spark computation capabilities without having to resort to PySpark APIs.

This blog discusses in detail about the key features of Koalas, and how you can optimize it using Apache Arrow to suit your requirements.

Introducing Koalas

Simply put, Koalas is a Python package that is similar to Pandas. It performs computation with Spark.

Features:

1. Koalas is lazy-evaluated like Spark, i.e., it executes only when triggered by an action.

2. You do not need a separate Spark context/Spark session for processing the Koalas dataframe. Koalas makes use of the existing Spark context/Spark session.

3. It has an SQL API with which you can perform query operations on a Koalas dataframe.

4. By configuring Koalas, you can even toggle computation between Pandas and Spark.

5. Koalas dataframe can be derived from both the Pandas and PySpark dataframes.

Following is a comparison of the syntaxes of Pandas, PySpark, and Koalas:

Versions used:

Pandas -> 0.24.2
Koalas -> 0.26.0
Spark -> 2.4.4
Pyarrow -> 0.13.0

Pandas:

import Pandas as pddf = pd.DataFrame({‘col1’: [1, 2], ‘col2’: [3, 4], ‘col3’: [5, 6]})
df[‘col4’] = df.col1 * df.col1

Spark

df = spark.read.option(“inferSchema”, “true”).option(“comment”, True).csv(“my_data.csv”)df = df.toDF(‘col1’, ‘col2’, ‘col3’)
df = df.withColumn(‘col4’, df.col1*df.col1)

Now, with Koalas:

import databricks.Koalas as ksdf = ks.DataFrame({‘col1’: [1, 2], ‘col2’: [3, 4], ‘col3’: [5, 6]})
df[‘col4’] = df.col1 * df.col1
You can use the same Pandas syntax for working with Koalas, for computing in a distributed environment as well.

Optimizations that can be performed using Koalas

1. Pandas to Koalas Conversion Optimization Methods:

Three different optimizations for the Pandas-to-Koalas conversion are discussed below. The last two methods make use of Apache Arrow, which is an intermediate columnar storage format, that helps in faster data transfer.

a) Pandas to Koalas conversion (Default method):

We can directly convert from Pandas to Koalas by using the Koalas.from_Pandas() API.

b) Pandas to Koalas conversion (With implicit Apache Arrow):

Apache arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized, language-independent, columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

To install:

pip install pyarrow

This method of optimization can be done by setting the spark.sql.execution.arrow.enabled to true as shown below:

c) Pandas to Koalas conversion (With explicit Apache Arrow):

This optimization involves converting Pandas to the Arrow table explicitly and then converting that to Koalas.

Benchmarking Conversion Optimization Techniques

As seen above, using Arrow explicitly in the conversion method is a more optimized way of converting Pandas to Koalas. This method becomes even more helpful if the size of the data keeps increasing.

2. Optimization on Display Limit:

To set the maximum number of rows to be displayed, the option display.max_rows can be set. The default limit is 1000.

Koalas.set_option(‘display.max_rows’, 2000)

3. Optimization on Computation Limit:

Even the computation limit can be toggled based on the row limit, by using compute.shortcut_limit. If the row count is beyond this limit, computation is done by Spark, if not, the data is sent to the driver, and computation is done by Pandas API. The default limit is 1000.

Koalas.set_option(‘compute.shortcut_limit‘, 2000)

4. Using Option context:

With option context, you can set a scope for the options. This can be done using with command.

with Koalas.option_context(“display.max_rows”, 10, “compute.max_rows”, 5):

… print(Koalas.get_option(“display.max_rows”))

… print(Koalas.get_option(“compute.max_rows”))

10
5

print(Koalas.get_option(“display.max_rows”))

print(Koalas.get_option(“compute.max_rows”))1000
1000

There are a few more options available here

5. Resetting options:

You can simply reset the options using reset_option.

Koalas.reset_option(“display.max_rows”)
Koalas.reset_option(“compute.max_rows”)

Conclusion:

Since Koalas uses Spark behind the scenes, it does come with some limitations. Do check that out before incorporating it. Being a Pandas developer, if your sole purpose of using Spark is for distributed computation, it is highly recommended to go with Koalas.

Note: This benchmarking was done using a Databricks machine (6GB,0.88 cores).

The post Koalas Library: Integrating Pandas with PySpark for Data Handling appeared first on Tiger Analytics.

Unlocking Data Insights: What You Must Know About Apache Kylin

onemg — Wed, 05 Jun 2019 15:08:39 +0000

This post is about Kylin, its architecture, and the various challenges and optimization techniques within it. There are many “OLAP in Hadoop” tools available – open source ones include Kylin and Druid and commercial ones include Atscale and Kyvos. I have used Apache Kylin because it is better suited to deal with historical data when compared to Druid.

What is Kylin?

Apache Kylin is an open source distributed analytical engine that provides SQL interface and multidimensional analysis (OLAP) on Hadoop supporting extremely large datasets. It pre-calculates OLAP cubes with a horizontal scalable computation framework (MR, Spark) and stores the cubes into a reliable and scalable datastore (HBase).

Why Kylin?

In most of the use cases in Big Data, we see the challenge is to get the result of a query within a second. It takes a lot of time to scan a database and return the results. This is where the concept of ‘OLAP in Hadoop’ emerged to combine the strength of OLAP and Hadoop and hence give a significant improvement in query latency.

Source: Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

How it Works?

Below are the steps on how Kylin fetches the data and saves the results:

First, it syncs the input source table. In most cases, it reads data from Hive.
Next, it runs MapReduce/Spark jobs (based on the engine you select) to pre-calculate and generate each level of cuboids with all possible combinations of dimensions and calculate all the metrics at different levels
Finally, it stores cube data in HBase where the dimensions are rowkeys and measures are column families.

Additionally, it leverages ZooKeeper for job coordination.

Kylin Architecture:

Source: Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

In Kylin, many cubing algorithms have been released and here are the three types of cubing:

By layer
By fast cubing = “In-mem”
By layer on Spark

On submitting a cubing job, Kylin pre-allocates steps for both “by-layer” and “in-mem”. But it only picks one to execute and the other one will be skipped. By default, the algorithm is “auto” and Kylin selects one of them based on its understanding of the data picked up from Hive.

Source: Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

So far, we got a glimpse of how Kylin works. Now let us see the real challenges and how to fix them and also how to optimize the cube building time.

Challenges and Workaround to Solve:

In Kylin 2.2, one cannot change the datatype of the measures column. By default, Kylin uses decimal(19,4) for the double type metric column. The workaround in order to change the datatype is to change the metadata of the cube by modifying it with the “metadata backup” and “restore” commands. (https://kylin.apache.org/docs/howto/howto_backup_metadata.html). After taking a backup, find the cube description in /cube_desc folder, find your cube, and then edit it. After the above changes are done, restart Kylin. Make sure to run the command below and restart Kylin as it expects that one will not manually edit the cube signature and hence this check: ./bin/metastore.sh refresh-cube-signature
In Kylin 2.3.2, when we query ‘select * from tablename’, it displays empty/null values in the metric column. This is because Kylin only stores the aggregated values and will display values only when you invoke the ‘group by’ clause in the query. But if you need to get the result, you can use Kylin query push-down feature if a query cannot be answered by any cube. Kylin supports pushing down such queries to backup query engines like Hive, SparkSQL, Impala through JDBC.
Sometimes, the jobs build fails continuously even if you discard and run again or resume it. The reason is that ZooKeeper may already have a Kylin directory, so the workaround is to remove Kylin from ZooKeeper, and then the cube builds successfully.

Summary

The key takeaway from this post is that Apache Kylin significantly improves the query latency provided that we control the unnecessary cuboid combinations using the “Aggregation Group”(AGG) feature Kylin provides. This feature helps in reducing the cube build time and querying time as well.

Hope this post has given some valuable insights about Apache Kylin. Happy Learning!

References- Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

The post Unlocking Data Insights: What You Must Know About Apache Kylin appeared first on Tiger Analytics.