Data Analysis Archives - Tiger Analytics Tue, 22 Oct 2024 13:11:59 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://www.tigeranalytics.com/wp-content/uploads/2023/09/favicon-Tiger-Analytics_-150x150.png Data Analysis Archives - Tiger Analytics 32 32 Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation https://www.tigeranalytics.com/perspectives/blog/building-dynamic-data-pipelines-with-snowpark-our-framework-to-drive-modern-data-transformation/ Mon, 21 Oct 2024 14:30:41 +0000 https://www.tigeranalytics.com/?post_type=blog&p=23878 Learn about the challenges of traditional data transformation methods and how a dynamic approach using metadata configuration can help address these issues. By defining transformation rules and specifications, enterprises can create flexible pipelines that adapt to their evolving data processing needs, ultimately accelerating the process of extracting insights from data.

The post Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation appeared first on Tiger Analytics.

]]>
Why is Data Transformation important?

E owns a retail business and makes their daily sales through their online and brick-and-mortar stores. Every day, they receive a huge wave of data in their systems—products viewed, orders placed, returns received, new customers, repeat customers, etc. Data ingestion pipelines help them ingest data from disparate sources. The ingested raw data pool is large and can generate valuable insights, but how do they harmonize it and turn it into actionable insights that boost the business?

That’s where Data transformation comes in.

Customers may have diverse data formats and reporting requirements, and that’s why data transformation forms a crucial aspect of any analytics product. If we examine traditional methods of manually building data transformation pipelines, we can see that these can be time-consuming and often delay the life cycle of deriving insights from data. One solution to this challenge is implementing dynamic transformation using metadata configuration. A flexible pipeline capable of adapting to data processing needs can be created by defining transformation rules and specifications.

Modernizing Data Platforms with Snowflake: Our Snowpark & Streamlit Solution

At Tiger Analytics, we created a custom solution with Snowpark and Streamlit to build dynamic data transformation pipelines based on a metadata-driven framework to accelerate data platform modernization into Snowflake. Once data is available in the raw layer, the current framework can be leveraged to build a data pipeline to transform and harmonize data in subsequent layers. The high-level architecture of this Snowpark framework is depicted below in Figure 1.


Figure 1: High-level architecture

Tiger’s Snowpark Data Transformation Framework Provides Five Key Functionalities:

  • Metadata Setup
  • Once the data ingestion is complete, this Snowpark-based framework is leveraged by users to feed in Source to Target Mapping (STTM) and additional driver (technical) metadata (Incremental/Full Load, Schedule timing, SCD types, etc.), which drives the entire process. The input details are parsed, validated with Snowflake raw layer, and merged into metadata tables.

  • Dynamic Transformation Pipeline Generation
  • After the metadata setup is complete, the next Snowpark code reads the STTM metadata table and creates data transformation SQL dynamically. Based on additional driver metadata information, the Snowpark code implements change data capture (CDC) based on watermark/audit columns on top of raw layer ingested data sets, Truncate and Load, Dynamic merge/insert SQL for SCD types, etc. It is purely driven by user-provided metadata input, and any changes will be refactored in the pipeline during its generation. For this Snowpark framework, it is just a matter of a few clicks to accommodate any changes like

    • New data source, logic to be included
    • Logic changes to existing pipeline
  • Schedule and Execution
  • Snowflake Task is used to implement orchestration for data transformation pipelines. It is created dynamically using the Snowpark framework. Users can provide preferred cadence information as input, which can be referenced to create tasks dynamically with Cron schedule. Once the Tasks are created and resumed, they will automatically trigger in the scheduled cadence. As this is a metadata-driven framework, the good part is that any changes will be accommodated in the pipeline quickly.

  • Alert & Notification
  • A Snowpark-based email notification framework helps trigger emails in Outlook and Slack notifications in case of any failures observed during data transformation pipeline execution.

  • Detailed Logging
  • This Snowpark framework captures detailed logging of each step for ease of debugging. It also provides information about the pipeline execution status, which can be a key metric for other downstream modules like Observability.

This Snowpark metadata driven framework can be leveraged in any data transformation program in Snowflake to set up dynamic transformation pipelines. It can help to accelerate the overall journey and will implement the business logic efficiently, with reduced code quality issues which often arise because of human errors.

Comparing our Snowpark Framework to Other Data Transformation Pipelines

What makes our Snowpark framework unique compared to other traditional data transformation pipelines?

  • Agility: Post finalization of the data model, this framework helps in faster implementation of transformation pipelines dynamically in Snowflake.
  • Flexibility: Allows users to make easy modifications in the logic to adapt to the specific needs of each customer without manually changing the underlying code.
  • Efficiency: Less efforts are needed to implement changes; when input metadata is changed, the pipeline gets updated.
  • Auto alerts: An email is triggered based on Snowflake’s native email integration feature to alert users of any pipeline execution failures.
  • Multi-feature support: Provides flexibility to enable SCD-type implementations, change data capture, and e2e detailed orchestration leveraging Snowflake Tasks.
  • Pipeline Monitoring: Proper logs are maintained at each level to trace back easily in case of any issues. Proven highly impactful, reducing the overall time for the support team to debug.
  • Scalability: Data volumes keep increasing. Dynamic transformation pipelines in cloud data warehouses like Snowflake can scale with the increasing demand.
  • Ease of Maintenance: Simple to maintain and update as metadata changes are made directly from the Streamlit app without altering the pipeline manually.

Final thoughts

At Tiger Analytics, we created a flexible metadata-driven solution to generate dynamic data transformation pipelines. With the help of the Snowpark framework, business teams can now rapidly convert their Source-to-Target Mapping (STTM) logic into executable pipelines, significantly reducing the time to market.

The ability to quickly adapt transformation rules and handle changes through a simple, metadata-driven process ensures that businesses can harmonize raw data from multiple silos without delays, delivering faster insights and value.

With its scalability, adaptability, and ease of maintenance, this framework allows organizations to manage complex transformations effectively. In short, dynamic data transformation ensures that the data is always optimized for decision-making, empowering businesses to stay agile and competitive.

The post Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation appeared first on Tiger Analytics.

]]>
How to Simplify Data Profiling and Management with Snowpark and Streamlit https://www.tigeranalytics.com/perspectives/blog/how-to-simplify-data-profiling-and-management-with-snowpark-and-streamlit/ Thu, 10 Oct 2024 12:41:48 +0000 https://www.tigeranalytics.com/?post_type=blog&p=23845 Learn why data quality is one of the most overlooked aspects of data management. While all models need good quality data to generate useful insights and patterns, data quality is especially important. In this blog, we explore how data profiling can help you understand your data quality. Discover how Tiger Analytics leverages Snowpark and Streamlit to simplify data profiling and management.

The post How to Simplify Data Profiling and Management with Snowpark and Streamlit appeared first on Tiger Analytics.

]]>
The accuracy of the data-to-insights journey is underpinned by one of the most foundational yet often overlooked aspects of data management – Data Quality. While all models need good quality data to generate useful insights and patterns, data quality is especially important across industries like retail, healthcare, and finance. Inconsistent, missing, or duplicate data can impact critical operations, from customer segmentation to and even affect regulatory compliance, resulting in potential financial or reputational losses.

Let’s look at an example:

A large retail company relies on customer data from various sources, such as online orders, in-store purchases, and loyalty program interactions. Over time, inconsistencies and errors in the customer database, such as duplicate records, incorrect addresses, and missing contact details, impacted the company’s ability to deliver personalized marketing campaigns, segment customers accurately, and forecast demand.

Data Profiling Matters – Third-party or Native app? Understanding the options

Data profiling helps the organization understand the nature of the data to build the data models, and ensures data quality and consistency, enabling faster decision-making and more accurate insights.

  • Improves Data Accuracy: Identifies inconsistencies, errors, and missing values.
  • Supports Better Decision-Making: Ensures reliable data for predictive analytics.
  • Enhances Efficiency: Helps detect and remove redundant data, optimizing resources and storage.

For clients using Snowflake for data management purposes, traditional data profiling tools often require moving data outside of Snowflake, creating complexity, higher costs, and security risks.

  • Data Transfer Overhead: External tools may require data to be moved out of Snowflake, increasing latency and security risks.
  • Scalability Limitations: Third-party tools may struggle with large Snowflake datasets.
  • Cost and Performance: Increased egress costs and underutilization of Snowflake’s native capabilities.
  • Integration Complexity: Complex setup and potential incompatibility with Snowflake’s governance and security features.

At Tiger Analytics, our clients faced a similar problem statement. To address these issues, we developed a Snowflake Native App utilizing Snowpark and Streamlit to perform advanced data profiling and analysis within the Snowflake ecosystem. This solution leverages Snowflake’s virtual warehouses for scalable, serverless computational power, enabling efficient profiling without external infrastructure.

How Snowpark Makes Data Profiling Simple and Effective

Snowpark efficiently manages large datasets by chunking data into smaller pieces, ensuring smooth profiling tasks. We execute YData Profiler and custom Python functions directly within Snowflake, storing results like outlier detection and statistical analysis for historical tracking.

We also created stored procedures and UDFs with Snowpark to automate daily or incremental profiling jobs. The app tracks newly ingested data, using Snowflake’s Task Scheduler to run operations automatically. Additionally, profiling outputs integrate seamlessly into data pipelines, with alerts triggered when anomalies are detected, ensuring continuous data quality monitoring.

By keeping operations within Snowflake, Snowpark reduces data transfer, lowering latency and enhancing performance. Its native integration ensures efficient, secure, and scalable data profiling.

Let’s look at the key features of the app, built leveraging Snowpark’s capabilities.

Building a Native Data Profiling App in Snowflake – Lessons learnt:

1. Comprehensive Data Profiling

At the core of the app’s profiling capabilities are the YData Profiler or custom-built profilers – Python libraries, integrated using Snowpark. These libraries allow users to profile data directly within Snowflake by leveraging its built-in compute resources.

Key features include:

  • Column Summary Statistics: The Quickly review important statistics for columns with all the datatypes like string, number, and date to understand the data at a glance.
  • Data Completeness Checks: Identify missing values and assess the completeness of your datasets to ensure no critical information is overlooked.
  • Data Consistency Checks: Detect anomalies or inconsistent data points to ensure that your data is uniform and accurate across the board.
  • Pattern Recognition and Value Distribution: Analyze data patterns and value distributions to identify trends or detect unusual values that might indicate data quality issues.
  • Overall Data Quality Checks: Review the health of your dataset by identifying potential outliers, duplicates, or incomplete data points.

2. Snowflake Compute Efficiency

The app runs entirely within Snowflake’s virtual warehouse environment. No external servers or machines are needed, as the app fully utilizes Snowflake’s built-in computing power. This reduces infrastructure complexity while ensuring top-tier performance, allowing users to profile and manage even large datasets efficiently.

3. Flexible Profiling Options

The app allows users to conduct profiling in two distinct ways—either by examining entire tables or by focusing on specific columns. This flexibility ensures that users can tailor the profiling process to their exact needs, from broad overviews to highly targeted analyses.

4. Full Data Management Capabilities

In addition to profiling, the app supports essential data management tasks. Users can insert, update, and delete records within Snowflake directly from the app, providing an all-in-one tool for both profiling and managing data.

5. Streamlit-Powered UI for Seamless Interaction

The app is built using Streamlit, which provides a clean, easy-to-use user interface. The UI allows users to interact with the app’s profiling and data management features without needing deep technical expertise. HTML-based reports generated by the app can be easily shared with stakeholders, offering clear and comprehensive data insights.

6. Ease in Generating and Sharing Profiling Reports

Once the data profiling is complete, the app generates a pre-signed URL that allows users to save and share the profiling reports. Here’s how it works:

  • Generating Pre-Signed URLs: The app creates a pre-signed URL to a file on a Snowflake stage using the stage name and relative file path. This URL provides access to the generated reports without requiring direct interaction with Snowflake’s internal storage.
  • Accessing Files: Users can access the files in the stage through several methods:
    • Navigate directly to the pre-signed URL in a web browser.
    • Retrieve the pre-signed URL within Snowsight by clicking on it in the results table.
    • Send the pre-signed URL in a request to the REST API for file support.
  • Handling External Stages: For files in external stages that reference Microsoft Azure cloud storage, the function requires Azure Active Directory authentication. This is because querying the function fails if the container is accessed using a shared access signature (SAS) token. The GET_PRESIGNED_URL function requires Azure Active Directory authentication to create a user delegation SAS token, utilizing a storage integration object that stores a generated service principal.

7. Different roles within an organization can utilize this app in various scenarios:

  • Data Analysts: Data analysts can use the app to profile datasets, identify inconsistencies, and understand data quality issues. They will analyze the patterns and relationships in the data and point out the necessary fixes to resolve any errors, such as missing values or outliers.
  • Data Stewards/Data Owners: After receiving insights from data analysts, data stewards or data owners can apply the suggested fixes to cleanse the data, ensuring it meets quality standards. They can make adjustments directly through the app by inserting, updating, or deleting records, ensuring the data is clean and accurate for downstream processes.

This collaborative approach between analysts and data stewards ensures that the data is high quality and reliable, supporting effective decision-making across the organization.

Final notes

Snowpark offers a novel approach to data profiling by bringing it into Snowflake’s native environment. This approach reduces complexity, enhances performance, and ensures security. Whether improving customer segmentation in retail, ensuring compliance in healthcare, or detecting fraud in finance, Snowflake Native Apps with Snowpark provides a timely solution for maintaining high data quality across industries.

For data engineers looking to address client pain points this translates to:

  • Seamless Deployment: Easily deployable across teams or accounts, streamlining collaboration.
  • Dynamic UI: The Streamlit-powered UI provides an interactive dashboard, allowing users to profile data without extensive technical knowledge.
  • Flexibility: Supports profiling of both Snowflake tables and external files (e.g., CSV, JSON) in external stages like S3 or Azure Blob.

With upcoming features like AI-driven insights, anomaly detection, and hierarchical data modeling, Snowpark provides a powerful and flexible platform for maintaining data quality across industries, helping businesses make smarter decisions and drive better outcomes.

The post How to Simplify Data Profiling and Management with Snowpark and Streamlit appeared first on Tiger Analytics.

]]>
Data Science Strategies for Effective Process System Maintenance https://www.tigeranalytics.com/perspectives/blog/harness-power-data-science-maintenance-process-systems/ https://www.tigeranalytics.com/perspectives/blog/harness-power-data-science-maintenance-process-systems/#comments Mon, 20 Dec 2021 16:42:57 +0000 https://www.tigeranalytics.com/?p=6846 Industry understanding of managing planned maintenance is fairly mature. This article focuses on how Data Science can impact unplanned maintenance, which demands a differentiated approach to build insight and understanding around the process and subsystems.

The post Data Science Strategies for Effective Process System Maintenance appeared first on Tiger Analytics.

]]>
Data Science applications are gaining significant traction in the preventive and predictive maintenance of process systems across industries. A clear mindset shift has made it possible to steer maintenance from using a ‘reactive’ (using a run-to-failure approach) to one that is proactive and preventive in nature.

Planned or scheduled maintenance uses data and experiential knowledge to determine the periodicity of servicing required to maintain the plant components’ good health. These are typically driven by plant maintenance teams or OEMs through maintenance rosters and AMCs. Unplanned maintenance, on the other hand, occurs at random, impacts downtime/production, safety, inventory, customer sentiment besides adding to the cost of maintenance (including labor and material).

Interestingly, statistics reveal that almost 50% of the scheduled maintenance projects are unnecessary and almost a third of them are improperly carried out. Poor maintenance strategies are known to cost organizations as much as 20% of their production capacity – shaving off the benefits that a move from reactive to preventive maintenance approach would provide. Despite years of expertise available in managing maintenance activities, unplanned downtime impacts almost 82% of businesses at least once every three years. Given the significant impact on production capacity, aggregated annual downtime costs for the manufacturing sector are upwards of $50 billion (WSJ) with average hourly costs of unplanned maintenance in the range of $250K.

It is against this backdrop that data-driven solutions need to be developed and deployed. Can Data Science solutions bring about significant improvement of the maintenance domain and prevent any or all of the above costs? Are the solutions scalable? Do they provide an understanding of what went wrong? Can they provide insights into alternative and improved ways to manage planned maintenance activities? Does Data Science help reduce all types of unplanned events or just a select few? These are questions that manufacturers need to be answered and it is for the experts from both maintenance and data science domains to address.

Industry understanding of managing planned maintenance is fairly mature. The highlight of this article is therefore focused on unplanned maintenance, which demands a differentiated approach to build insight and understanding around the process and subsystems.

Data Science solutions are accelerating the industry’s move towards ‘on-demand’ maintenance wherein interventions are made only if and when required. Rather than follow a fixed maintenance schedule, data science tools can now aid plants to increase run lengths between maintenance cycles in addition to improving plant safety and reliability. Besides the direct benefits that result in reduced unplanned downtime and cost of maintenance, operating equipment at higher levels of efficiency improves the overall economics of operation.

The success of this approach was demonstrated in refinery CDU preheat trains that use soft sensing triggers to decide when to process ‘clean crude’ (to mitigate the fouling impact) or schedule maintenance of fouled exchangers. Other successes were in the deployment of plant-wide maintenance of control valves, multiple-effect evaporators in plugging service, compressors in petrochemical service, and a geo-wide network of HVAC systems.

Instead of using a fixed roster for maintenance of PID control valves, plants can now detect and diagnose control valves that are malfunctioning. Additionally, in combination with domain and operations information, it can be used to suggest prescriptive actions such as auto-tuning of the valves, which improve maintenance and operations metrics.

Reducing unplanned, unavoidable events

It is important to bear in mind that not all unplanned events are avoidable. The inability to avoid events could be either because they are not detectable enough or because they are not actionable. The latter could occur either because the response time available is too low or because the knowledge to revert a system to its normal state does not exist. A large number of unplanned events however are avoidable, and the use of data science tools improves their detection and prevention with greater accuracy.

The focus of the experts working in this domain is to reduce unplanned events and transition events from unavoidable to avoidable. Using advanced tools for detection, diagnosis, and enabling timely actions to be taken, companies have managed to reduce their downtime costs significantly. The diversity of solutions that are available in the maintenance area covers both plant and process subsystems.

Some of the data science techniques deployed in the maintenance domain are briefly described below:

Condition Monitoring
This has been used to monitor and analyze process systems over time, and predict the occurrence of an anomaly. These events or anomalies could have short or long propagation times such as the ones seen in the fouling in exchangers or in the cavitation in pumps. The spectrum of solutions in this area includes real-time/offline modes of analysis, edge/IoT devices and open/closed loop prescriptions, and more. In some cases, monitoring also involves the use of soft sensors to detect fouling, surface roughness, or hardness – these parameters cannot be measured directly using a sensor and therefore, need surrogate measuring techniques.

Perhaps one of the most unique challenges working in the manufacturing domain is in the use of data reconciliation. Sensor data tend to be spurious and prone to operational fluctuations, drift, biases, and other errors. Using raw sensor information is unlikely to satisfy the material and energy balance for process units. Data reconciliation uses a first-principles understanding of the process systems and assigns a ‘true value’ to each sensor. These revised sensor values allow a more rigorous approach to condition monitoring, which would otherwise expose process systems to greater risk when using raw sensor information. Sensor validation, a technique to analyze individual sensors in tandem with data reconciliation, is critical to setting a strong foundation for any analytics models to be deployed. These elaborate areas of work ensure a greater degree of success when deploying any solution that involves the use of sensor data.

Fault Detection
This is a mature area of work and uses solutions ranging from those that are driven entirely by domain knowledge, such as pump curves and detection of anomalies thereof, to those that rely only on historical sensor/maintenance/operations data for analysis. An anomaly or fault is defined as a deviation from ‘acceptable’ operation but the context and definitions need to be clearly understood when working with different clients. Faults may be related to equipment, quality, plant systems, or operability. A good business context and understanding of client requirements are necessary for the design and deployment of the right techniques. From basic tools that use sensor thresholds, run charts, and more advanced techniques such as classification, pattern analysis, regression, a wide range of solutions can be successfully deployed.

Early Warning Systems
The detection of process anomalies in advance helps in the proactive management of abnormal events. Improving actionability or response time allows faults to be addressed before setpoints/interlocks are triggered. The methodology varies across projects and there is no ‘one-size-fits-all’ approach. Problem complexity could range from using single sensor information as lead indicators (such as using sustained pressure loss in a vessel to identify a faulty gasket that might rupture) to far more complex methods of analysis.

Typical challenges faced in developing early warning systems are in the 100% detectability of anomalies but an even larger challenge is in filtering out false indications of anomalies. The detection of 100% of the anomalies and the robust filtering techniques are critical factors to consider for successful deployment.

Enhanced Insights for Fault Identification
The importance of detection and response time in the prevention of an event cannot be overstated. But what if an incident is not easy to detect or the propagation of the fault is too rapid to allow us any time for action? The first level involves using machine-driven solutions for detection such as computer vision models, which are rapidly changing the landscape. Using these models, it is now possible to improve prediction accuracies of processes that were either not monitored or used manual monitoring. The second is to integrate the combined expertise of personnel from various job functions such as technologists, operators, maintenance engineers, and supervisors. At this level of maturity, the solution is able to baseline with the best that current operations aim to achieve. The third and by far the most complex is to move more faults in the ‘detectable’ and actionable realm. One such case was witnessed in a complex process from the metal smelting industry. Advanced-Data Science techniques using a digital twin amplified signal responses and analyzed multiple process parameters to predict the occurrence of an incident ahead of time. By gaining order of magnitude improvement in response time, it was possible to move the process fault from an unavoidable to an avoidable and actionable category.

With the context provided above, it is possible to choose a modeling approach and customize the solutions to suit the problem landscape:

data analytics in process system maintenance

Different approaches to Data Analytics

Domain-driven solution
First-principles and the rule-based approach is an example of a domain-driven solution. Traditional ways of delivering solutions for manufacturing often involve computationally intensive solutions (such as process simulation, modeling, and optimization). In one of the difficult-to-model plants, deployment was done using rule engines that allow domain knowledge and experience to determine patterns and cause-effect relationships. Alarms were triggered and advisories/recommendations were sent to the concerned stakeholders regarding what specific actions to undertake each time the model identified an impending event.

Domain-driven approaches also come in handy in the case of ‘cold start’ where solutions need to be deployed with little or no data availability. In some deployments in the mechanical domain, the first-principles approach helped identify >85% of the process faults even at the start of operations.

Pure data-driven solutions
A recent trend seen in the process industry is the move away from domain-driven solutions due to challenges in getting the right skills to deploy solutions, computation infrastructure requirements, customized maintenance solutions, and the requirement to provide real-time recommendations. Complex systems such as naphtha cracking, alumina smelting which are hard to model have harnessed the power of data science to not just diagnose process faults but also enhance response time and bring more finesse to the solutions.

In some cases, domain-driven tools have provided high levels of accuracy in analyzing faults. One such case was related to compressor faults where domain data was used to classify them based on a loose bearing, defective blade, or polymer deposit in the turbine subsystems. Each of these faults was identified using sensor signatures and patterns associated with it. Besides getting to the root cause, this also helped prescribe action to move the compressor system away from anomalous operation.

These solutions need to bear in mind that the operating envelope and data availability covers all possible scenarios. The poor success of deployments using this approach is largely due to insufficient data that covers plant operations and maintenance. However, the number of players offering a purely data-driven solution is large and soon replacing what was traditionally part of a domain engineer’s playbook.

Blended solutions
Blended solutions for the maintenance of process systems combine the understanding of both data science and domain. One such project was in the real-time monitoring and preventive maintenance of >1200 HVAC units across a large geographic area. The domain rules were used to detect and diagnose faults and also identify operating scenarios to improve the reliability of the solutions. A good understanding of the domain helps in isolating multiple anomalies, reducing false positives, suggesting the right prescriptions, and more importantly, in the interpretability of the data-driven solutions.

The differentiation comes from using the combined intelligence from AI / ML models, domain knowledge and knowledge of deployment success are integrated into the model framework.

Customizing the toolkit and determining the appropriate modeling approach are critical to delivery. Given the uniqueness of each plant and problem and the requirement for a high degree of customization, makes the deployment of solutions in a manufacturing environment is fairly challenging. This fact is validated by the limited number of solution providers serving this space. However, the complexity and nature of the landscape need to be well understood by both the client and the service provider. It is important to note that not all problems in the maintenance space are ‘big data’ problems requiring analysis in real-time, using high-frequency data. Some faults with long propagation times can use values averaged over a period of time while other systems with short response time requirements may require real-time data. Where maintenance logs and annotations related to each event (and corrective action) are recorded, one could go with a supervised learning approach, but this is not always possible. In cases where data on faults and anomalies are not available, a one-class approach to classify the operation into normal/abnormal modes has also been used. Solution maturity improves with more data and failure modes identified over time.

A staged solution approach helps in bringing in the right level of complexity to deliver solutions that evolve over time. Needless to say, it takes a lot of experience and prowess to marry the generalized understanding with the customization that each solution demands.

Edge/IoT

A fair amount of investment needs to be made at the beginning of the project to understand the hardware and solution architecture required for successful deployment. While the security of data is a primary consideration, other factors such as computational power, cost, time, response time, open/closed-loop architecture are added considerations in determining the solution framework. Experience and knowledge help understand additional sensing requirements and sensor placement, performance enhancement through edge/cloud-based solutions, data privacy, synchronicity with other process systems, and much more.

By far, the largest challenge is witnessed on the data front (sparse, scattered, unclean, disorganized, unstructured, not digitized, and so on) that prevent businesses from seeing quick success. Digitization and creating data repositories, which set the foundation for model development, take a lot of time.

There is also a multitude of control systems, specialized infrastructure, legacy systems within the same manufacturing complex that one may need to work through. End-to-end delivery with the front-end complexity in data management creates a significant entry barrier for service providers in the maintenance space.

Maintenance cuts across multiple layers of a process system. The maintenance solutions vary as one moves from a sensor to a control loop, equipment with multiple control valves all the way to a flowsheet/enterprise layer. Maintenance across these layers requires a deep understanding of both the hardware as well as process aspects, a combination that is often hard to put together. Sensors and control valves are typically maintained by those with an Instrumentation background, while equipment maintenance could fall in a mechanical or chemical engineer’s domain. On the other hand, process anomalies that could have a plant-level impact are often in the domain of operations/technology experts or process engineers.

Data Science facilitates the development of insights and generalizations required to build understanding around a complex topic like maintenance. It helps in the generalization and translation of learnings across layers within the process systems from sensors all the way to enterprise and other industry domains as well. It is a matter of time before analytics-driven solutions that help maintain safe and reliable operations become an integral part of plant operations and maintenance systems. We need to aim towards the successes that we witness in the medical diagnostics domain where intelligent machines are capable of detecting and diagnosing anomalies. We hope that similar analytics solutions will go a long way to keep plants safe, reduce downtime and provide the best of operations efficiencies that a sustainable world demands.

Today, the barriers to success are in the ability to develop, a clear understanding of the problem landscape, plan end-to-end and deliver customized solutions that take into account business priorities and ROI. Achieving success at a large scale will demand reducing the level of customization required in each deployment – a constraint that is overcome by few subject matter experts in the area today.

The post Data Science Strategies for Effective Process System Maintenance appeared first on Tiger Analytics.

]]>
https://www.tigeranalytics.com/perspectives/blog/harness-power-data-science-maintenance-process-systems/feed/ 1