Data Insights Archives - Tiger Analytics

What is Data Observability Used For?

TA@2023 — Fri, 27 Sep 2024 10:35:54 +0000

Imagine you’re managing a department that handles account openings in a bank. All services seem fine, and the infrastructure seems to be working smoothly. But one day, it becomes clear that no new account has been opened in the last 24 hours. On investigation, you find that this is because one of the microservices involved in the account opening process is taking a very long time to respond.

For such a case, the data analyst examining the problem can use traces with triggers based on processing time. But there must be an easier way to spot anomalies.
Traditional monitoring involves recording the performance of the infrastructure and applications. Data observability allows you to track your data flows and find faults in them (may even extend to business processes). While traditional tools analyze infrastructure and applications using metrics, logs, and traces, data observability uses data analysis in a broader sense.

So, how do we tackle the case of no new account creation in 24 hours?

The data analyst could use traces with time-based triggers. There has to be an easier way of detecting potential anomalies on site.

A machine learning model is used to predict future events, such as the volume of future sales, by utilizing regularly updated historical data. However, because the input data may not always be of perfect quality, the model can sometimes produce inaccurate forecasts. These inaccuracies can lead to either excess inventory for the retailer or, worse, out-of-stock situations when there is consumer demand.

Classifying and Addressing Unplanned Events

The point of Data Observability is to identify so-called data downtime. Data Downtime refers to a sudden unplanned event in your business/infrastructure/code that leads to a sudden change in the data. In other words, it is the process of finding anomalies in data.

How can you classify these events?

Exceeding a given metric value or an abnormal jump in a given metric. This type is the simplest. Imagine that you add 80-120 clients every day (confidence interval with some probability), and in one day, only 20. Perhaps something caused it to drop suddenly, and it’s worth looking into.
Abrupt change in data structure. Let’s take a past example with clients. Everything was fine, but one day, the contact information field began to receive empty values. Perhaps something has broken in your data pipeline, and it’s better to check.
The occurrence of a certain condition or deviation from it. Just as GPS coordinates should not show a truck in the ocean, banking transactions should not suddenly appear in unexpected locations or in unusual amounts that deviate significantly from the norm.
Statistical anomalies. During a routine check, the bank’s analysts notice that on a particular day, the average ATM withdrawal per customer spiked to $500, which is significantly higher than the historical average.

On the one hand, it seems that there is nothing new in this approach of classifying abnormal events and taking the necessary remedial action. But on the other hand, previously there were no comprehensive and specialized tools for these tasks.

Data Observability is Essential for Ensuring Fresh, Accurate, and Smooth Data Flow

Data observability serves as a checkup for your systems. It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

Persona	Why Question	Observability Use case	Business Outcome
Business User	WHY Data quality metrics are in Amber/Red WHY is my dataset/report not accurate WHY do I see a sudden demand for my product and what is the root cause	Data Quality, Anomaly Detection and RCA	Improve the quality of insights Boost trust and confidence in decision making
Data Engineers/Data Reliability Engineers	WHY there is data downtime WHY did the pipeline fail WHY there is an SLA breach in Data Freshness	Data Pipeline Observability, Troubleshooting and RCA	Better Productivity Speed up MTTR Enhance Pipeline efficiency Intelligent Triaging
Data Scientists	WHY the model predictions are not accurate	Data Quality Model	Improve Model Reliability

Tiger Analytics’ Continuous Observability Solution

Continuous monitoring and alerting of potential issues (gathered from various sources) before a customer/operations reports an issue. Consists of Set of tools, patterns and practices to build Data Observability components for your big data workloads in Cloud platform to reduce DATA DOWNTIME.

Select examples of our experience in Data observability and Quality

Tools and Technology

Tiger Analytics Data Observability is set of tools, patterns and best practices to:

Ingest MELT(Metrics, Events, Logs, Traces) data
Enrich, Store MELT for getting insights on Event & Log Correlations, Data Anomalies, Pipeline Failures, Performance Metrics
Configure Data Quality rules using a Self Service UI
Monitor Operational Metrics like Data quality, Pipeline health, SLAs
Alert Business team when there is Data Downtime
Perform Root cause analysis
Fix broken pipelines and data quality issues

Which will help:

Minimize data downtime using automated data quality checks
Discover data problems before they impact the business KPIs
Accelerate Troubleshooting and Root Cause Analysis
Boost productivity and reduce operational cost
Improve Operational Excellence, QoS, Uptime

Data observability and Generative AI (GenAI) can play crucial roles in enhancing data-driven decision-making and machine learning (ML) model performance.

The combination of data observability primes the pump by instilling confidence with smooth sailing, high-quality and always available data which forms a foundation for any data-driven initiative while GenAI enables to realize what is achievable through it, opening up new avenues into how we can simulate, generate or even go beyond innovate. Organizations can use both to improve their data capabilities, decision-making processes, and innovation with different areas.

Thus, Monte Carlo, a company that produces a tool for data monitoring, raised $135 million, Observe – $112 million, Acceldata – $100 million have an excellent technology medium in the Data Observability space.

To summarize

Data Observability is an approach to identifying anomalies in business processes and the operation of applications and infrastructure, allowing users to quickly respond to emerging incidents.It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

And if there is no particular novelty in technology, there is certainly novelty in the approach, tools and new terms that make it possible to better convince investors and clients. The next few years will show how successful new players will be in the market.

References

https://www.oreilly.com/library/view/data-observability-for/9781804616024/
https://www.oreilly.com/library/view/data-quality-fundamentals/9781098112035/

The post What is Data Observability Used For? appeared first on Tiger Analytics.

Unlocking Data Insights: What You Must Know About Apache Kylin

onemg — Wed, 05 Jun 2019 15:08:39 +0000

This post is about Kylin, its architecture, and the various challenges and optimization techniques within it. There are many “OLAP in Hadoop” tools available – open source ones include Kylin and Druid and commercial ones include Atscale and Kyvos. I have used Apache Kylin because it is better suited to deal with historical data when compared to Druid.

What is Kylin?

Apache Kylin is an open source distributed analytical engine that provides SQL interface and multidimensional analysis (OLAP) on Hadoop supporting extremely large datasets. It pre-calculates OLAP cubes with a horizontal scalable computation framework (MR, Spark) and stores the cubes into a reliable and scalable datastore (HBase).

Why Kylin?

In most of the use cases in Big Data, we see the challenge is to get the result of a query within a second. It takes a lot of time to scan a database and return the results. This is where the concept of ‘OLAP in Hadoop’ emerged to combine the strength of OLAP and Hadoop and hence give a significant improvement in query latency.

Source: Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

How it Works?

Below are the steps on how Kylin fetches the data and saves the results:

First, it syncs the input source table. In most cases, it reads data from Hive.
Next, it runs MapReduce/Spark jobs (based on the engine you select) to pre-calculate and generate each level of cuboids with all possible combinations of dimensions and calculate all the metrics at different levels
Finally, it stores cube data in HBase where the dimensions are rowkeys and measures are column families.

Additionally, it leverages ZooKeeper for job coordination.

Kylin Architecture:

Source: Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

In Kylin, many cubing algorithms have been released and here are the three types of cubing:

By layer
By fast cubing = “In-mem”
By layer on Spark

On submitting a cubing job, Kylin pre-allocates steps for both “by-layer” and “in-mem”. But it only picks one to execute and the other one will be skipped. By default, the algorithm is “auto” and Kylin selects one of them based on its understanding of the data picked up from Hive.

Source: Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

So far, we got a glimpse of how Kylin works. Now let us see the real challenges and how to fix them and also how to optimize the cube building time.

Challenges and Workaround to Solve:

In Kylin 2.2, one cannot change the datatype of the measures column. By default, Kylin uses decimal(19,4) for the double type metric column. The workaround in order to change the datatype is to change the metadata of the cube by modifying it with the “metadata backup” and “restore” commands. (https://kylin.apache.org/docs/howto/howto_backup_metadata.html). After taking a backup, find the cube description in /cube_desc folder, find your cube, and then edit it. After the above changes are done, restart Kylin. Make sure to run the command below and restart Kylin as it expects that one will not manually edit the cube signature and hence this check: ./bin/metastore.sh refresh-cube-signature
In Kylin 2.3.2, when we query ‘select * from tablename’, it displays empty/null values in the metric column. This is because Kylin only stores the aggregated values and will display values only when you invoke the ‘group by’ clause in the query. But if you need to get the result, you can use Kylin query push-down feature if a query cannot be answered by any cube. Kylin supports pushing down such queries to backup query engines like Hive, SparkSQL, Impala through JDBC.
Sometimes, the jobs build fails continuously even if you discard and run again or resume it. The reason is that ZooKeeper may already have a Kylin directory, so the workaround is to remove Kylin from ZooKeeper, and then the cube builds successfully.

Summary

The key takeaway from this post is that Apache Kylin significantly improves the query latency provided that we control the unnecessary cuboid combinations using the “Aggregation Group”(AGG) feature Kylin provides. This feature helps in reducing the cube build time and querying time as well.

Hope this post has given some valuable insights about Apache Kylin. Happy Learning!

References- Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

The post Unlocking Data Insights: What You Must Know About Apache Kylin appeared first on Tiger Analytics.