Data Science Archives - Tiger Analytics

The Data Scientist’s Guide to Product Thinking and Why it Matters

onemg — Wed, 28 May 2025 14:24:13 +0000

A recent Gartner survey found that 49% of AI adoption still needs to demonstrate business value. The survey also highlighted that only a meager 9% of organizations are AI-mature.

If we look more closely at the AI and analytics deployment landscape, we see that these projects tend to suffer from siloed development processes, where teams work without a single vision. This leads to disconnected solutions that may crack a unique problem but do not add anything of value to larger business goals. The result? The value derived ends up being short-lived.

Building point solutions is no longer the echoing mantra despite the buzz around it. Rather, today, the focus is on building integrated solutions to business problems while ensuring lasting impact.

Data science practitioners need to evolve from providing specialized point solutions to developing comprehensive platforms and products, which enhance efficiency, scalability, and usability for businesses. To achieve this, a product-thinking mindset may be a game changer in how teams operate, develop and implement AI projects.

Having been an active part of the advanced analytics industry, we at Tiger Analytics anticipated this direction by developing strong capabilities in building robust platforms and productized solutions.

Bringing in Product Thinking to Contrast Traditional AI Deployment

When deploying AI, the typical focus is on solving immediate problems or delivering quick solutions. Product thinking takes a different approach, looking beyond the short term and focusing on creating value that lasts. Here’s what sets it apart:

Holistic integration: Ensures AI works smoothly with current systems and future plans.
Scalability and adaptability: Builds solutions that grow and change as needed.
Continuous improvement: Keeps refining and improving based on ongoing feedback.
Strategic vision: Aligns AI projects with the long-term goals of the company.

Ultimately, AI projects turn into integral components of an all-encompassing strategy

Why Should AI and Data Science Experts Embrace Product Thinking?

At its core, product thinking focuses on the user and what they want to accomplish. Adopting this user-centric approach ensures that AI projects are not just technically sound, but also relevant and impactful in terms of greater user satisfaction and higher adoption rates. A product-thinking mindset also helps data experts align solutions more closely to business objectives and bridge the project value gap.

It encourages continuous improvement, adaptability, scalability and creative problem solving
It ensures relevance in a rapidly changing landscape
It shifts the focus to long-term product quality and maintainability

So, organizations must embed product thinking into their foundation so that they remain aligned with market shifts and are always ready to tackle new challenges lurking around the corner.

At Tiger Analytics, we began experimenting with various use cases, developing AI solutions with a product-thinking mindset.

Using A Product-Thinking Mindset To Enhance Planogram Compliance And Store Digitization

One of the most compelling examples of product thinking in action is our collaboration with a leading American multinational in the consumer goods sector. The client was looking to transform its approach to planogram compliance and store digitization.

Identifying the challenges

Their existing process relied heavily on manual store visits, where representatives would physically inspect store shelves to ensure they aligned with planograms. This process was error-prone, labor-intensive, and expensive.

Hence, the first phase was to establish the objective of building a scalable solution for automating the compliance audit and digitizing store operations through a mobile application.

Developing the solution

In the next phase, we developed a mobile application that could function without constant internet connectivity. The architecture was modular, enabling global features with market-specific customization.

This cross-platform solution leveraged various frameworks, standardizing the process around ONNX for model optimization. It included developing capability modules, such as tools for video analytics, smart annotation, and image stitching.

Our team also adopted an iterative development approach, continuously improving the solution based on feedback. Data challenges, like the lack of historical data for new SKUs, were addressed through few-shot learning and heuristic-based decision rules for rack identification. The result was a high-performance model that could be deployed on mobile devices.

Unlocking the impact of product thinking

Previously, store representatives had to physically visit each location, manually inspecting the shelf layouts. Audit reports were also generated by hand, which only added to the inaccuracies and costs.

With the implementation of Tiger Analytics’ automated solution, the process was streamlined.
Now, representatives can take a picture of the shelves using a mobile app designed for real-time processing. The app immediately checks the images against the planogram, automatically generating key performance indicators that guide them to execute rack adjustments.

Once the adjustments are made, the data is synced to the cloud for instant report generation and real-time monitoring.

Driving measurable business outcomes

This shift from a manual to a digital process reduced the time required for audits from 30-40 minutes to five to six minutes— an 85% improvement. Creating an end-to-end workflow, from strategy to compliance, reporting, and analytics, automated this process across 10 different markets globally.

The accuracy of the audits too improved, rising to 90% from the previous 60%.

On the technical front, the size of the machine learning model was reduced from 25MB when deployed on the cloud to just 5MB on edge devices. Image stitching, a crucial component, improved significantly, reducing the completion time from four minutes to five seconds.

Product Thinking For Data Scientists – The Tiger Way

Establishing a culture of product thinking ensures we deliver services and solutions that are tailored to the dynamic needs of today’s market. Here’s how we are building this culture of product thinking at Tiger Analytics:

Integrating product thinking into every project
Crafting efficient and scalable solutions that solve real-world problems
Ensuring our cross-functional teams collaborate seamlessly, so that each solution is not only data-driven but also built on a comprehensive ecosystem that is efficient and scalable

We believe that embedding product thinking into every project aids us in creating smarter and more responsive solutions and ultimately, refined product development journeys.

Impact of Product Thinking-Led AI: The Time Is Now

Adopting a product-thinking mindset in AI projects can transform how we approach the development process. The focus has shifted from AI projects for short-term pain points to creating solutions that get to the root of the real-world problems and drive tangible business impact. Considering the speed at which AI is becoming an integral force of daily operations, a product-thinking mindset empowers AI projects to thrive – not just survive.

For instance, instead of only optimizing a predictive model’s accuracy, the goal can be expanded to consider how the model can improve user experience, such as recommending products based on customer behavior. This mindset also encourages cross-functional collaboration with larger teams like product management and engineering, leading to better-aligned solutions

Then, instead of only crunching the numbers, AI solutions can continuously deliver insights that matter, help predict user behaviors, fine-tune features, and more. As AI systems learn and improve using the data they process, this iterative cycle of data collection, analysis, and application will strengthen a continuous feedback loop.

Thus by focusing on building a resilient data analytics framework based on AI advancements, tomorrow’s products will be increasingly responsive, predictive, and to business owners’ and data science practitioners’ collective relief, more successful.

The post The Data Scientist’s Guide to Product Thinking and Why it Matters appeared first on Tiger Analytics.

Analytics Leadership Transformation in Response to COVID-19: Key Strategies

onemg — Sat, 10 Dec 2022 18:45:14 +0000

Summary

Analytics Leaders play a pivotal role in making their company agile, resilient and successful in the post-pandemic world.
Embracing cloud-based technology is key to a resilient operation.
Business challenges are best solved by proactively partnering with business functions to deploy the tools of data science.
Uncertainty is here to stay. Prepare your company to withstand the shock.
Focus analytical efforts on the needs of business users for maximum impact.

You’re a Business Intelligence, Data Science, or Analytics Leader, or maybe all three. The COVID-19 pandemic is having a transformative impact on business and your team can play a pivotal role in helping your company navigate these uncertain times, build resilience, and thrive. Here’s how:

Be a cloud champion

Remote work will be the norm for a long time to come. Getting your organization on the cloud ensures performance, scale, flexibility, and reliability. Data and models are always accessible, can be updated from anywhere, and can feed any system. Redundancy is built-in. A truly global analytics operation gives your company the power to identify even the tiniest changes earlier and respond faster.

Do more than innovate: co-innovate!

Data, digital technology, and analytics are among the best weapons your company has against uncertainty and rapid change. Co-innovating means finding ways for analytics to contribute to every aspect of your company’s operations and performance. Proactively seek out opportunities to solve business challenges by reaching out to business leaders.

Many commercial activities have gone online during the pandemic. They are unlikely to ever return completely to their legacy in-person state. In this environment, analytics can make substantial, impactful contributions both within your company and externally.

Monitoring algorithms customized to your company can flag cybersecurity risks that off-the-shelf software may not catch. Vast amounts of operating data gathered by the Industrial Internet of Things is already used for remote diagnostics and predicting maintenance needs. Digital twins simulate product behavior and can support extreme scenario testing. Artificial Intelligence models can significantly increase the company’s ability to detect trends and respond to them.

Co-innovating will not only harness the power of data science to solve business challenges; it will also make your company more agile and resilient.

Prepare your company for future shocks

The environmental, social, and economic changes affecting the planet all but ensure that extreme events will recur fairly frequently in the future. Traditionally, companies have planned for the ‘normal’ and retained untested contingency plans for the ‘extreme,’ but that may no longer suffice. Data gathered during COVID-19 will be of immense value in simulating extreme scenarios and preparing robust contingency plans. The protocols and procedures developed to analyze COVID-19 data can also be applied in extreme scenario planning.

Make analytical insights accessible, understandable, and actionable

As the pandemic drives rapid changes in demand patterns, supply constraints, macroeconomic conditions and geopolitics, speed and adaptability are key. Your company needs up-to-date, reliable, and easy-to-use data and analysis that accelerate decision-making and are highly sensitive to minor changes.

With a laser focus on the needs of business users struggling to cope with COVID-19-induced uncertainty, analytics should be:

Easy to access on any authorized device, at any time by anyone in the company who needs them.
Integrated seamlessly into end systems throughout the organization, where they can speed up decision-making.
User-friendly, with visualizations that non-data scientists can understand and interpret easily.
Updated frequently, preferably in real-time, as new data comes in. COVID-19 is causing circumstances to change so fast that waiting to gather ‘enough’ data can delay results and render analyses irrelevant.
Automated to be fast and cost-effective to deploy. Keeping operating expenses under control will be a key factor as companies struggle to recover from the loss of business. While human supervision is required, manual model maintenance is a luxury your company can no longer afford.
Highly sensitive, so as to detect minute anomalies in data that may indicate changed behavior patterns, opportunities, or risks as the pandemic situation evolves.
Relevant to the challenges faced by business functions and teams in improving your company’s operating efficiency, such as supply chain optimization, inventory turns, or capacity utilization.
Granular at a level where the organization can take actions directly derived from their results and tied directly to business metrics such as margins, costs, and revenue.
Stress-tested to be able to handle extreme scenarios and support contingency planning, not just the ‘normal’ case. Post-pandemic, there may no longer be a ‘normal’ case. Extreme scenarios are likely to arise far more frequently due to macroeconomic and geopolitical uncertainty.

COVID-19 has accelerated the pace of business significantly. Data science, partnered with digital technology, can be the lifeline that companies need to navigate through the uncertainty. As Analytics Leaders, you play a pivotal role in ensuring that your organization captures every opportunity, mitigates every risk, and thrives in the challenging post-pandemic environment.

The post Analytics Leadership Transformation in Response to COVID-19: Key Strategies appeared first on Tiger Analytics.

Cracking the Code of Polymorphism in Organic Crystals: A Breakthrough in Pharmaceutical Research

TA@2023 — Wed, 20 Jul 2022 14:53:07 +0000

The pharmaceutical industry splurges billions of dollars in R&D year on year to develop drugs and vaccines to improve our health and well-being. Despite the investments in R&D and various checks and balances in place to develop stable and safe drugs, the industry occasionally runs into instances like that of Norvir® of Abbott Laboratory in 1998, when Abbott’s small molecule capsules were recalled from the market due to failed dissolution tests [1]. The core problem behind the failure is the poor understanding of the polymorphism of organic crystals.

The active pharmaceutical ingredients (APIs) of the drugs that patients take are organic molecules that are small (e.g., molecular weight <500 Daltons) and structurally flexible. Under various manufacturing/storage conditions such as temperature and humidity, these molecules are packed differently and form different crystal structures, leading to a phenomenon called polymorphism. One such extraordinary example is ROY (Figure 1)—the same molecule has been discovered to be able to form as many as 11 different crystals with markedly different shapes and colors, holding the current record for the largest number of fully characterized organic crystal polymorphs [2,3].

Figure 1: An extended family of ROY – colors, morphologies, and melting points

To advance the understanding of polymorphism of organic crystals, a team of scientists from the University of Wisconsin at Madison (the same lab that discovered ROY), AbbVie, and Tiger Analytics spearheaded the study of nucleation/growth behavior of D-arabitol, sugar alcohol, both at bulk and at liquid/vapor interface. With creative experimental design and meticulous executions, the researchers observed the following intriguing phenomena:

(1) The surface nucleation rate is 12 orders of magnitude faster than its bulk counterpart.
(2) The surface crystal has a different structure from its bulk counterpart.
(3) The higher the temperature, the faster the surface nucleation.

Corroborated by the Molecular Dynamics simulation, the researchers can unambiguously relate the polymorphism to the molecular packing in various environments (i.e., surface vs. bulk). The freedom at the liquid/vapor interface enables the molecules to break loose from the rigid 3-dimensional hydrogen-bond network (typically observed in bulk) to form a 2-dimensional layered structure (Figure 2).

Figure 2: Different nucleation pathways lead to distinct crystal structures/molecular packings.

The current work pioneered the direct measurement of surface/bulk nucleation rates of the same organic molecule (first of its kind), elucidated the mechanism of polymorphism supported by both experiment and simulation, and offered practical solutions to prevent organic small molecule drugs from crystallizing into unwanted polymorph. The research work has been published in the prestigious Journal of the American Chemical Society [4]. You can access the full paper here.

Sources:

1. John Bauer, Stephen Spanton, Rodger Henry, John Quick, Walter Dziki, William Porter, and John Morris, Ritonavir: An Extraordinary Example of Conformational Polymorphism, Pharmaceutical Research, 2001, 18, 6, 859-866.
2. Lian Yu, Polymorphism in Molecular Solids: An Extraordinary System of Red, Orange, and Yellow Crystals
Acc. Chem. Res. 2010, 43, 9, 1257–1266.
3. Bernardo A. Nogueira, Chiara Castiglioni and Rui Fausto, Color polymorphism in organic crystals, Communications Chemistry 2020, 3, 34.
4. Xin Yao, Qitong Liu, Bu Wang, Junguang Yu, Michael M. Aristov, Chenyang Shi, Geoff G. Z. Zhang, and Lian Yu, Anisotropic Molecular Organization at a Liquid/Vapor Interface Promotes Crystal Nucleation with Polymorph Selection, J. Am. Chem. Soc. 2022, 144, 26, 11638–11645.

The post Cracking the Code of Polymorphism in Organic Crystals: A Breakthrough in Pharmaceutical Research appeared first on Tiger Analytics.

Data-Driven Disruption? How Analytics is Shifting Gears in the Auto Market

onemg — Thu, 24 Mar 2022 12:43:31 +0000

In an age when data dictates decision-making, from cubicles to boardrooms, many auto dealers worldwide continue to draw insights from past experiences. However, the automotive market is ripe with opportunities to leverage data science to improve operational efficiency, workforce productivity, and consequently – customer loyalty.

Data challenges faced by automotive dealers

There are many reasons why auto dealers still struggle to collect and use data. The biggest one is the presence of legacy systems that bring entangled processes with disparate data touchpoints. This makes it difficult to consolidate information and extract clean, structured data – especially when there are multiple repositories. More importantly, they are unable to derive and harness actionable insights to improve their decision-making capabilities, instead of merely relying on gut instincts.

In addition, the sudden growth of the BEV/PHEV market has proven to complicate matters – with increasing pressure on regulatory compliance.

But the reality is that future-ready data management is a must-have strategy – not just to thrive but even to survive today’s automotive market. The OEMs are applying market pressure on one side of the spectrum – expecting more cost-effective vehicle pricing models to establish footprints in smaller or hyper-competitive markets. On the other side, modern customers are making it abundantly clear that they will no longer tolerate broken, inefficient, or repetitive experiences. And if you have brands operating in different parts of the world, data management can be a nightmarishly time-consuming and complex journey.

Future-proofing the data management strategy

Now, it’s easier said than done for the automotive players to go all-in on adopting a company-wide data mindset. It is pertinent to create an incremental data-driven approach to digital transformation that looks to modernize in phases. Walking away from legacy systems with entangled databases means that you must be assured of hassle-free deployment and scalability. It can greatly help to prioritize which markets/OEMs/geographies you want to target first, with data science by your side.

Hence, the initial step is to assess the current gaps and challenges to have a clear picture of what needs to be fixed on priority and where to go from thereon. Another key step in the early phase should be to bring in the right skill sets to build a future-proofed infrastructure and start streamlining the overall flow of data.

It is also important to establish a CoE model to globalize data management from day zero. In the process, a scalable data pipeline should be built to consolidate information from all touchpoints across all markets and geographies. This is a practical way to ensure that you have an integrated source of truth that churns out actionable insights based on clean data.

You also need to create a roadmap so that key use cases can be detected with specific markets identified for initial deployment. But first, you must be aware of the measurable benefits that can be unlocked by tapping into the power of data.

Better lead scoring: Identify the leads most likely to purchase a vehicle and ensure targeted messaging.
Smarter churn prediction: Identify aftersales customers with high churn propensity and send tactical offers.
Accurate demand forecasting: Reduce inventory days, avoid out-of-stock items, and minimize promotional costs.
After-sales engagement: Engage customers even after the initial servicing warranty is over regarding repairs, upgrades, etc. as well an effective parts pricing strategy.
Sales promo assessment: Analyze historical sales data, seasonality/trends, competitors, etc., to recommend the best-fit promo.
Personalized customer engagement: Customize interactions with customers based on data-rich actionable intelligence instead of unreliable human instincts.

How we helped Inchcape disrupt the automotive industry

When Tiger Analytics began the journey with Inchcape, a leading global automotive distributor, we knew that it was going to disrupt how the industry tapped into data. Fast-forward to a year later, we were thrilled to recently take home Microsoft’s ‘Partner of the Year 2021’ award in the Data & AI category. What started as a small-scale project grew into one of the largest APAC-based AI and Advanced Analytics projects. We believe that this project has been a milestone moment for the automotive industry at large. If you’re interested in finding out how our approach raised the bar in a market notorious for low data adoption, please read our full case study.

The post Data-Driven Disruption? How Analytics is Shifting Gears in the Auto Market appeared first on Tiger Analytics.

Data Science Strategies for Effective Process System Maintenance

onemg — Mon, 20 Dec 2021 16:42:57 +0000

Data Science applications are gaining significant traction in the preventive and predictive maintenance of process systems across industries. A clear mindset shift has made it possible to steer maintenance from using a ‘reactive’ (using a run-to-failure approach) to one that is proactive and preventive in nature.

Planned or scheduled maintenance uses data and experiential knowledge to determine the periodicity of servicing required to maintain the plant components’ good health. These are typically driven by plant maintenance teams or OEMs through maintenance rosters and AMCs. Unplanned maintenance, on the other hand, occurs at random, impacts downtime/production, safety, inventory, customer sentiment besides adding to the cost of maintenance (including labor and material).

Interestingly, statistics reveal that almost 50% of the scheduled maintenance projects are unnecessary and almost a third of them are improperly carried out. Poor maintenance strategies are known to cost organizations as much as 20% of their production capacity – shaving off the benefits that a move from reactive to preventive maintenance approach would provide. Despite years of expertise available in managing maintenance activities, unplanned downtime impacts almost 82% of businesses at least once every three years. Given the significant impact on production capacity, aggregated annual downtime costs for the manufacturing sector are upwards of $50 billion (WSJ) with average hourly costs of unplanned maintenance in the range of $250K.

It is against this backdrop that data-driven solutions need to be developed and deployed. Can Data Science solutions bring about significant improvement of the maintenance domain and prevent any or all of the above costs? Are the solutions scalable? Do they provide an understanding of what went wrong? Can they provide insights into alternative and improved ways to manage planned maintenance activities? Does Data Science help reduce all types of unplanned events or just a select few? These are questions that manufacturers need to be answered and it is for the experts from both maintenance and data science domains to address.

Industry understanding of managing planned maintenance is fairly mature. The highlight of this article is therefore focused on unplanned maintenance, which demands a differentiated approach to build insight and understanding around the process and subsystems.

Data Science solutions are accelerating the industry’s move towards ‘on-demand’ maintenance wherein interventions are made only if and when required. Rather than follow a fixed maintenance schedule, data science tools can now aid plants to increase run lengths between maintenance cycles in addition to improving plant safety and reliability. Besides the direct benefits that result in reduced unplanned downtime and cost of maintenance, operating equipment at higher levels of efficiency improves the overall economics of operation.

The success of this approach was demonstrated in refinery CDU preheat trains that use soft sensing triggers to decide when to process ‘clean crude’ (to mitigate the fouling impact) or schedule maintenance of fouled exchangers. Other successes were in the deployment of plant-wide maintenance of control valves, multiple-effect evaporators in plugging service, compressors in petrochemical service, and a geo-wide network of HVAC systems.

Instead of using a fixed roster for maintenance of PID control valves, plants can now detect and diagnose control valves that are malfunctioning. Additionally, in combination with domain and operations information, it can be used to suggest prescriptive actions such as auto-tuning of the valves, which improve maintenance and operations metrics.

Reducing unplanned, unavoidable events

It is important to bear in mind that not all unplanned events are avoidable. The inability to avoid events could be either because they are not detectable enough or because they are not actionable. The latter could occur either because the response time available is too low or because the knowledge to revert a system to its normal state does not exist. A large number of unplanned events however are avoidable, and the use of data science tools improves their detection and prevention with greater accuracy.

The focus of the experts working in this domain is to reduce unplanned events and transition events from unavoidable to avoidable. Using advanced tools for detection, diagnosis, and enabling timely actions to be taken, companies have managed to reduce their downtime costs significantly. The diversity of solutions that are available in the maintenance area covers both plant and process subsystems.

Some of the data science techniques deployed in the maintenance domain are briefly described below:

Condition Monitoring
This has been used to monitor and analyze process systems over time, and predict the occurrence of an anomaly. These events or anomalies could have short or long propagation times such as the ones seen in the fouling in exchangers or in the cavitation in pumps. The spectrum of solutions in this area includes real-time/offline modes of analysis, edge/IoT devices and open/closed loop prescriptions, and more. In some cases, monitoring also involves the use of soft sensors to detect fouling, surface roughness, or hardness – these parameters cannot be measured directly using a sensor and therefore, need surrogate measuring techniques.

Perhaps one of the most unique challenges working in the manufacturing domain is in the use of data reconciliation. Sensor data tend to be spurious and prone to operational fluctuations, drift, biases, and other errors. Using raw sensor information is unlikely to satisfy the material and energy balance for process units. Data reconciliation uses a first-principles understanding of the process systems and assigns a ‘true value’ to each sensor. These revised sensor values allow a more rigorous approach to condition monitoring, which would otherwise expose process systems to greater risk when using raw sensor information. Sensor validation, a technique to analyze individual sensors in tandem with data reconciliation, is critical to setting a strong foundation for any analytics models to be deployed. These elaborate areas of work ensure a greater degree of success when deploying any solution that involves the use of sensor data.

Fault Detection
This is a mature area of work and uses solutions ranging from those that are driven entirely by domain knowledge, such as pump curves and detection of anomalies thereof, to those that rely only on historical sensor/maintenance/operations data for analysis. An anomaly or fault is defined as a deviation from ‘acceptable’ operation but the context and definitions need to be clearly understood when working with different clients. Faults may be related to equipment, quality, plant systems, or operability. A good business context and understanding of client requirements are necessary for the design and deployment of the right techniques. From basic tools that use sensor thresholds, run charts, and more advanced techniques such as classification, pattern analysis, regression, a wide range of solutions can be successfully deployed.

Early Warning Systems
The detection of process anomalies in advance helps in the proactive management of abnormal events. Improving actionability or response time allows faults to be addressed before setpoints/interlocks are triggered. The methodology varies across projects and there is no ‘one-size-fits-all’ approach. Problem complexity could range from using single sensor information as lead indicators (such as using sustained pressure loss in a vessel to identify a faulty gasket that might rupture) to far more complex methods of analysis.

Typical challenges faced in developing early warning systems are in the 100% detectability of anomalies but an even larger challenge is in filtering out false indications of anomalies. The detection of 100% of the anomalies and the robust filtering techniques are critical factors to consider for successful deployment.

Enhanced Insights for Fault Identification
The importance of detection and response time in the prevention of an event cannot be overstated. But what if an incident is not easy to detect or the propagation of the fault is too rapid to allow us any time for action? The first level involves using machine-driven solutions for detection such as computer vision models, which are rapidly changing the landscape. Using these models, it is now possible to improve prediction accuracies of processes that were either not monitored or used manual monitoring. The second is to integrate the combined expertise of personnel from various job functions such as technologists, operators, maintenance engineers, and supervisors. At this level of maturity, the solution is able to baseline with the best that current operations aim to achieve. The third and by far the most complex is to move more faults in the ‘detectable’ and actionable realm. One such case was witnessed in a complex process from the metal smelting industry. Advanced-Data Science techniques using a digital twin amplified signal responses and analyzed multiple process parameters to predict the occurrence of an incident ahead of time. By gaining order of magnitude improvement in response time, it was possible to move the process fault from an unavoidable to an avoidable and actionable category.

With the context provided above, it is possible to choose a modeling approach and customize the solutions to suit the problem landscape:

Different approaches to Data Analytics

Domain-driven solution
First-principles and the rule-based approach is an example of a domain-driven solution. Traditional ways of delivering solutions for manufacturing often involve computationally intensive solutions (such as process simulation, modeling, and optimization). In one of the difficult-to-model plants, deployment was done using rule engines that allow domain knowledge and experience to determine patterns and cause-effect relationships. Alarms were triggered and advisories/recommendations were sent to the concerned stakeholders regarding what specific actions to undertake each time the model identified an impending event.

Domain-driven approaches also come in handy in the case of ‘cold start’ where solutions need to be deployed with little or no data availability. In some deployments in the mechanical domain, the first-principles approach helped identify >85% of the process faults even at the start of operations.

Pure data-driven solutions
A recent trend seen in the process industry is the move away from domain-driven solutions due to challenges in getting the right skills to deploy solutions, computation infrastructure requirements, customized maintenance solutions, and the requirement to provide real-time recommendations. Complex systems such as naphtha cracking, alumina smelting which are hard to model have harnessed the power of data science to not just diagnose process faults but also enhance response time and bring more finesse to the solutions.

In some cases, domain-driven tools have provided high levels of accuracy in analyzing faults. One such case was related to compressor faults where domain data was used to classify them based on a loose bearing, defective blade, or polymer deposit in the turbine subsystems. Each of these faults was identified using sensor signatures and patterns associated with it. Besides getting to the root cause, this also helped prescribe action to move the compressor system away from anomalous operation.

These solutions need to bear in mind that the operating envelope and data availability covers all possible scenarios. The poor success of deployments using this approach is largely due to insufficient data that covers plant operations and maintenance. However, the number of players offering a purely data-driven solution is large and soon replacing what was traditionally part of a domain engineer’s playbook.

Blended solutions
Blended solutions for the maintenance of process systems combine the understanding of both data science and domain. One such project was in the real-time monitoring and preventive maintenance of >1200 HVAC units across a large geographic area. The domain rules were used to detect and diagnose faults and also identify operating scenarios to improve the reliability of the solutions. A good understanding of the domain helps in isolating multiple anomalies, reducing false positives, suggesting the right prescriptions, and more importantly, in the interpretability of the data-driven solutions.

The differentiation comes from using the combined intelligence from AI / ML models, domain knowledge and knowledge of deployment success are integrated into the model framework.

Customizing the toolkit and determining the appropriate modeling approach are critical to delivery. Given the uniqueness of each plant and problem and the requirement for a high degree of customization, makes the deployment of solutions in a manufacturing environment is fairly challenging. This fact is validated by the limited number of solution providers serving this space. However, the complexity and nature of the landscape need to be well understood by both the client and the service provider. It is important to note that not all problems in the maintenance space are ‘big data’ problems requiring analysis in real-time, using high-frequency data. Some faults with long propagation times can use values averaged over a period of time while other systems with short response time requirements may require real-time data. Where maintenance logs and annotations related to each event (and corrective action) are recorded, one could go with a supervised learning approach, but this is not always possible. In cases where data on faults and anomalies are not available, a one-class approach to classify the operation into normal/abnormal modes has also been used. Solution maturity improves with more data and failure modes identified over time.

A staged solution approach helps in bringing in the right level of complexity to deliver solutions that evolve over time. Needless to say, it takes a lot of experience and prowess to marry the generalized understanding with the customization that each solution demands.

Edge/IoT

A fair amount of investment needs to be made at the beginning of the project to understand the hardware and solution architecture required for successful deployment. While the security of data is a primary consideration, other factors such as computational power, cost, time, response time, open/closed-loop architecture are added considerations in determining the solution framework. Experience and knowledge help understand additional sensing requirements and sensor placement, performance enhancement through edge/cloud-based solutions, data privacy, synchronicity with other process systems, and much more.

By far, the largest challenge is witnessed on the data front (sparse, scattered, unclean, disorganized, unstructured, not digitized, and so on) that prevent businesses from seeing quick success. Digitization and creating data repositories, which set the foundation for model development, take a lot of time.

There is also a multitude of control systems, specialized infrastructure, legacy systems within the same manufacturing complex that one may need to work through. End-to-end delivery with the front-end complexity in data management creates a significant entry barrier for service providers in the maintenance space.

Maintenance cuts across multiple layers of a process system. The maintenance solutions vary as one moves from a sensor to a control loop, equipment with multiple control valves all the way to a flowsheet/enterprise layer. Maintenance across these layers requires a deep understanding of both the hardware as well as process aspects, a combination that is often hard to put together. Sensors and control valves are typically maintained by those with an Instrumentation background, while equipment maintenance could fall in a mechanical or chemical engineer’s domain. On the other hand, process anomalies that could have a plant-level impact are often in the domain of operations/technology experts or process engineers.

Data Science facilitates the development of insights and generalizations required to build understanding around a complex topic like maintenance. It helps in the generalization and translation of learnings across layers within the process systems from sensors all the way to enterprise and other industry domains as well. It is a matter of time before analytics-driven solutions that help maintain safe and reliable operations become an integral part of plant operations and maintenance systems. We need to aim towards the successes that we witness in the medical diagnostics domain where intelligent machines are capable of detecting and diagnosing anomalies. We hope that similar analytics solutions will go a long way to keep plants safe, reduce downtime and provide the best of operations efficiencies that a sustainable world demands.

Today, the barriers to success are in the ability to develop, a clear understanding of the problem landscape, plan end-to-end and deliver customized solutions that take into account business priorities and ROI. Achieving success at a large scale will demand reducing the level of customization required in each deployment – a constraint that is overcome by few subject matter experts in the area today.

The post Data Science Strategies for Effective Process System Maintenance appeared first on Tiger Analytics.

Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins

onemg — Thu, 24 Dec 2020 18:19:09 +0000

Historically, manufacturing equipment maintenance has been done during scheduled service downtime. This involves periodically stopping production for carrying out routine inspections, maintenance, and repairs. Unexpected equipment breakdowns disrupt the production schedule; require expensive part replacements, and delay the resumption of operations due to long procurement lead times.

Sensors that measure and record operational parameters (temperature, pressure, vibration, RPM, etc.) have been affixed on machinery at manufacturing plants for several years. Traditionally, the data generated by these sensors was compiled, cleaned, and analyzed manually to determine failure rates and create maintenance schedules. But every equipment downtime for maintenance, whether planned or unplanned, is a source of lost revenue and increased cost. The manual process was time-consuming, tedious, and hard to handle as the volume of data rose.

The ability to predict the likelihood of a breakdown can help manufacturers take pre-emptive action to minimize downtime, keep production on track, and control maintenance spending. Recognizing this, companies are increasingly building both reactive and predicted computer-based models based on sensor data. The challenge these models face is the lack of a standard framework for creating and selecting the right one. Model effectiveness largely depends on the skill of the data scientist. Each model must be built separately; model selection is constrained by time and resources, and models must be updated regularly with fresh data to sustain their predictive value.

As more equipment types come under the analytical ambit, this approach becomes prohibitively expensive. Further, the sensor data is not always leveraged to its full potential to detect anomalies or provide early warnings about impending breakdowns.

In the last decade, the Industrial Internet of Things (IIoT) has revolutionized predictive maintenance. Sensors record operational data in real-time and transmit it to a cloud database. This dataset feeds a digital twin, a computer-generated model that mirrors the physical operation of each machine. The concept of the digital twin has enabled manufacturing companies not only to plan maintenance but to get early warnings of the likelihood of a breakdown, pinpoint the cause, and run scenario analyses in which operational parameters can be varied at will to understand their impact on equipment performance.

Several eminent ‘brand’ products exist to create these digital twins, but the software is often challenging to customize, cannot always accommodate the specific needs of each and every manufacturing environment, and significantly increases the total cost of ownership.

ML-powered digital twins can address these issues when they are purpose-built to suit each company’s specific situation. They are affordable, scalable, self-sustaining, and, with the right user interface, are extremely useful in telling machine operators the exact condition of the equipment under their care. Before embarking on the journey of leveraging ML-powered digital twins, certain critical steps must be taken:

1. Creation of an inventory of the available equipment, associated sensors and data.

2. Analysis of the inventory in consultation with plant operations teams to identify the gaps. Typical issues may include missing or insufficient data from the sensors; machinery that lacks sensors; and sensors that do not correctly or regularly send data to the database.

3. Coordination between the manufacturing operations and analytics/technology teams to address some gaps: installing sensors if lacking (‘sensorization’); ensuring that sensor readings can be and are being sent to the cloud database; and developing contingency approaches for situations in which no data is generated (e.g., equipment idle time).

4. A second readiness assessment, followed by a data quality assessment, must be performed to ensure that a strong foundation of data exists for solution development.

This creates the basis for a cloud-based, ML-powered digital twin solution for predictive maintenance. To deliver the most value, such a solution should:

Use sensor data in combination with other data as necessary
Perform root cause analyses of past breakdowns to inform predictions and risk assessments
Alert operators of operational anomalies
Provide early warnings of impending failures
Generate forecasts of the likely operational situation
Be demonstrably effective to encourage its adoption and extensive utilization
Be simple for operators to use, navigate and understand
Be flexible to fit the specific needs of the machines being managed

When model-building begins, the first step is to account for the input data frequency. As sensors take readings at short intervals, timestamps must be regularized and resamples taken for all connected parameters where required. At this time, data with very low variance or too few observations may be excised. Model data sets containing sensor readings (the predictors) and event data such as failures and stoppages (the outcomes) are then created for each machine using both dependent and independent variable formats.

To select the right model for anomaly detection, multiple models are tested and scored on the full data set and validated against history. To generate a short-term forecast, gaps related to machine testing or idle time must be accounted for, and a range of models evaluated to determine which one performs best.

Tiger Analytics used a similar approach when building these predictive maintenance systems for an Indian multinational steel manufacturer. Here, we found that regression was the best approach to flag anomalies. For forecasting, the accuracy of Random Forest models was higher compared to ARIMA, ARIMAX, and exponential smoothing.

Using a modular paradigm to build ML-powered digital twin makes it straightforward to implement and deploy. It does not require frequent manual recalibration to be self-sustaining, and it is scalable so it can be implemented across a wide range of equipment with minimal additional effort and time.

Careful execution of the preparatory actions is as important as strong model-building to the success of this approach and its long-term viability. To address the challenge of low-cost, high-efficiency predictive maintenance in the manufacturing sector, employ this sustainable solution: a combination of technology, business intelligence, data science, user-centric design, and the operational expertise of the manufacturing employees.

This article was first published in Analytics India Magazine.

The post Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins appeared first on Tiger Analytics.

Enhancing Mental Healthcare: Machine Learning’s Role in Clinical Trials

onemg — Thu, 29 Oct 2020 23:59:56 +0000

World Mental Health Day on 10th October casts a long-overdue spotlight on one of the most neglected areas of public health. Nearly a billion people have a mental disorder, and a suicide occurs every 40 seconds. In developing countries, under 25% of people with mental, substance use, or neurological disorders receive treatment¹. COVID-19 has worsened the crisis; with healthcare services disrupted, the hidden pandemic of mental ill-health remains largely unaddressed.

In this article, we share some perspectives on the role ML can play and an example of a real-life AI solution we built at Tiger Analytics to address a specific mental-health-related problem.

ML is already a Part of Physical Healthcare

Algorithms process Magnetic Resonance Imaging (MRI) scans. Clinical notes are parsed to pinpoint the onset of illnesses earlier than physicians can discern them. Cardiovascular disease and diabetes —two of the leading causes of death worldwide— are diagnosed using neural networks, decision trees, and support vector machines. Clinical trials are monitored and assessed remotely to maintain physical distancing protocols.

These are ‘invasive’ approaches with the objective of automating what can —and usually is— be done by humans, but at speed and scale. In the field of mental health, ML can be applied in non-invasive, more humanistic ways that nudge physicians towards better treatment strategies.

Clinical Trials of Mental Health Drugs

In clinical trials of mental health drugs, physicians and patients engage in detailed discussions of the patients’ mental state at each treatment stage. The efficacy of these drugs is determined using a combination of certain biomarkers, body vitals, and mental state as determined by the patient’s interaction with the physician.

The problem with the above approach is that an important input in determining drug efficacy is the responses of a person who has been going through mental health issues. To avoid errors, these interviews/interactions are recorded, and multiple experts listen to the long recordings to evaluate the quality of the interview and the conclusions made.

Two concerns arise: first, time and budget allow only a sample of interviews to be evaluated, which means there is an increased risk of fallacious conclusions regarding drug efficacy; and second, patients may not express all they are feeling in words. A multitude of emotions may be missed or misinterpreted, generating incorrect evaluation scores.

The Problem that Tiger Team Tackled

Working with a pharmaceutical company, Tiger Analytics used speech analytics to identify ‘good’ interviews, i.e., ones that meet quality standards for inclusion in clinical trials, minimizing the number of interviews that were excluded after evaluation, and saving time and expense.

As a data scientist, the typical challenges you face when working on a problem such as this are – What types of signal processing you can use to extract audio features? What non-audio features would be useful? How do you remove background noise in the interviews? How do you look for patterns in language? How do you solve for reviewers’ biases, inevitable in subjective events like interviews?

Below we walk you through the process the Tiger Analytics team used to develop the solution.

Step 1: Pre-processing

We removed background noise from the digital audio files and split them into alternating sections of speech and silence. We grouped the speech sections into clusters, each cluster representing one speaker. We created a full transcript of the interview to enable language processing.

Step 2: Feature extraction

We extracted several hundred features of the audio, from direct aspects like interview duration and voice amplitude to the more abstract speech rates, frequency-wise energy content, and Mel-frequency cepstral coefficients (MFCCs). We used NLP to extract several features from the interview transcript. These captured the unique personal characteristics of individual speakers.

Beyond this, we captured features such as interview length, tone of the interviewer, any gender-related patterns, interview load on the physician, time of the day, and many more features.

Step 3: Prediction

We constructed an Interview Quality Score (IQS) representing the combination of several qualitative and quantitative aspects of each interview. We ensembled boosted trees, support vector machines, and random forests to segregate high-quality interviews from those with issues.

This model was able to effectively pre-screen about 75% of the interviews as good or bad and was unsure about the remainder. Reviewers could now work faster and more productively, focusing only on the interviews where the model was not too confident. Overall prediction accuracy improved 2.5x, with some segments returning over 90% accuracy.

ML Models ‘Hear’ What’s Left Unsaid

The analyses provided independent insights regarding pauses, paralinguistics (tone of voice, loudness, inflection, pitch), speech disfluency (fillers like ‘er’, ‘um’), and physician performance during such interviews.

These models have wider applicability beyond clinical trials. Physicians can use model insights to guide treatment and therapy, leading to better mental health outcomes for their patients, whether in clinical trials or practice, addressing one of the critical public health challenges of our time.

References

World Health Organization, United for Global Mental Health and the World Federation for Mental Health joint news release, 27 August 2020
This article was first published in Analytics India Magazine – https://analyticsindiamag.com/machine-learning-mental-health-notes-from-tiger-analytics/

The post Enhancing Mental Healthcare: Machine Learning’s Role in Clinical Trials appeared first on Tiger Analytics.

Strategic Guide: Maximize the Power of Bayesian Belief Networks

onemg — Thu, 20 Aug 2020 13:32:56 +0000

Bayesian Belief Network

In our previous post on the Bayesian Belief Network, we learned about the basic concepts governing a BBN, belief propagation, and the construction of a discrete Bayesian Belief Network.

Armed with that knowledge, let us now explore in detail the following three key characteristics of the Bayesian Belief Network (BBN):

1. Event Prediction

2. Driver Analysis

3. Intervention Assessment

We’ve illustrated these characteristics with a real-world example. In health care services, the Member Experience Survey (MES) is sent to random customers who had issues about health care services and had contacted the customer care department. These customers are asked to rate the services they have availed currently or in the past. The output of the survey analysis is a score on a scale of 1-10. Based on the scores, customers are then divided into 3 categories as follows- score 0 – 6: Detractors, 7 – 8: Passive, 9 – 10: Promoters. Net Promoter Score (NPS) is a metric widely used by various businesses to understand customer satisfaction and the potential for business growth. It is calculated as “% Promoters – % of Detractors”. It is evident from the formula that if we want to increase NPS, we must control the % detractors. Thus, it becomes imperative to understand the various drivers of detractors.

Under the assumption that the level of dissatisfaction/irritation is the reason for a customer to rank low in the MES survey, we will hypothesize a few prominent features along with a target to construct a discrete Bayesian Belief Network, and demonstrate the concept behind analyzing the drivers for that target and the effect of intervention at various levels. The features are:

Service Type: Customers are not satisfied with a few of the services.
Claim Cost: Denied claims with high claim costs may cause more dissatisfaction.
Past Call: Many calls for the same issue may cause dissatisfaction/irritation.
Lifestyle: Claim cost depends on the lifestyle of the customer.
Income: Lifestyle depends on monthly income.
Age-gender: Lifestyle also depends on the age and gender of a customer.

For now, let us assume the above features and their propositions are true and construct a Bayesian Belief Network structure.

Let us reduce the number of nodes in the above use case and consider Service Type, Claim Cost, and Past Calls as the only predictors which explain the causality of the detractors in the model. A synthetic dataset is generated to illustrate this example. For each node, we have the belief and conditional probability tables as shown in the diagram below.

Parameter Learning

Conditional Probabilities table (CPT) for each node:

Now, given the CPTs for all the nodes, the joint distribution is estimated as below:

We now have the structure of the network, CPTs, prior beliefs for each node, and the joint distribution in place. This completes the Bayesian Belief Network. Let us go back to the three key characteristics of the Bayesian Belief Network, which we wanted to explore.

Event Prediction

How likely is it that a customer will become a detractor if he has called customer service once in the past within a defined time frame? The question sounds trivial as we have predicted such probabilities many times. With Bayesian Belief Network, given the data, we can estimate the probability of a customer being a detractor. We can use the concepts of marginal probability and Bayes theorem to estimate the probability as follows:

So, the evidence that a customer has called 1-2 times in the past, propagates through the network and we see the probability of him not being a detractor has updated itself from 0.749 to .92. This indicates that customers who called 1-2 times have a negative impact on whether a customer will become a detractor or not.

Similarly, we can see the propagation of multiple pieces of evidence through the network. For example, a customer requested service type A and has called 3-4 times. How likely is it for such a customer to become a detractor? The BBN will help you answer that.

Evidence/Driver Analysis

Interestingly, another question could be: what factors influence a customer to become a detractor?

The beauty of BBN is that it treats all nodes impartially, and it doesn’t differentiate between targets and predictors. Thus, the underlying probability propagation concepts remain the same for both of them. Using this feature, we can analyze the reasons/drivers behind specific evidence. If the evidence is that a customer is a detractor, we can analyze the drivers/causal factors. In this example, we have updated the detractor as 100%, and consequently, the evidence propagates through the network and updates all the beliefs. Let us look at the posteriors in the below figure:

We can see the jump in the probability of service C from .27 to 0.62, and similarly, we can see the jump in the Past Call node for “3-4” and “4+”.

Interpretation 1

We can infer that customers who are opting for service C are having more issues as compared to other types of services, and consequently, they are calling a higher number of times.

Interpretation 2

Alternatively, we can also interpret that customers who are opting for service C are having complex issues, and customer care agents are not skilled enough to solve those issues.

Intervention Assessment

What if the management wants to work on the skills of the customer care advocates/agents, and they want to control the frequency of calls. Let us assess what will be the impact of their efforts if they’re going to bring “4+” past calls down to 0?

Honestly, it is a little tricky, in a way that we have to make small changes in our network. In this case, we need to remove all parental links from the actionable node, as it is controlled externally. Evidence to this node will no longer be observed. The modified network will look as follows in the diagram:

Besides, we also have to make some adjustments to the node attributes. Since we want to bring the “4+” calls to 0, it means the number of past calls will be <= 4. To handle this scenario, let us have two attributes “0-4” and “4+,” and we will change the evidence of the “0-4” level as 100%.

When we look at the posteriors now, we see that restricting the number of calls between 0-4 has the desired impact on the detractors. It updates the beliefs, and the probability of being a detractor has reduced from 25% to 18%, causing around 28% reduction in the detractor base.

Conclusion

We can apply BBN for many scenarios. However, it works best where we have collinearity or dependencies among the predictor variables, and the variables should be ordinal/categorical with a lesser number of levels.

Unlike other ML models, while quantifying the explainability of predictors, BBN also considers the interdependencies between them. If there are no interdependencies among predictors, then it is as good as other ML models from an application perspective. Still, Bayesian interpretability makes it more intuitive.

Too many variables make parameter learning and maintenance more difficult. Not all the buckets in CPT will have sufficient data to justify the events. However, due to its graphical nature and parent-child relationship, it relatively reduces the number of possible events, unlike other Bayesian methods.

Also, finding the optimum structure becomes complicated with too many variables.

In a competitive environment, the correct interpretation of drivers is crucial. Even slight biases may have a significant impact on the result. Bayesian Belief Network provides a useful but straightforward and intuitive method to respect the correlation between predictors and calculate the strength of drivers considering path modeling. Along with drivers, it also helps business executives to analyze the action plan. Its prediction power can be married with other machine learning algorithms to improve the model prediction.

References

1. https://www.bayesia.com/bayesian-networks-introduction
2. https://www.hindawi.com/journals/jat/2017/2525481/
3. https://www.edureka.co/blog/bayesian-networks/
4. https://www.saedsayad.com/docs/Bayesian_Belief_Network.pdf
5. https://cse.sc.edu/~mgv/csce582sp20/links/mradEtAl_APIN2015.pdf
6. https://github.com/sujeettiger/Bayesian-Belief-Network

The post Strategic Guide: Maximize the Power of Bayesian Belief Networks appeared first on Tiger Analytics.

A Beginner’s Guide: Enter the World of Bayesian Belief Networks

onemg — Thu, 09 Jul 2020 11:13:56 +0000

Introduction

In the world of machine learning and advanced analytics, every day data scientists solve tons of problems with the help of newly developed and sophisticated AI techniques. The main focus while solving these problems is to deliver highly accurate and error-free results. However, while implementing these techniques in a business context, it is essential to provide a list of actionable levers/drivers of the model output that the end-users can use to make business decisions. This requirement applies to solutions developed across industries. One such machine learning technique that focuses on providing such actionable insights is the Bayesian Belief Network, which is the focus of this blog. The assumption here is that the reader has some understanding of machine learning and some of the associated terminologies.

Several approaches are currently being used to understand these drivers/levers. However, most of them follow a simple approach to understand the direct cause and effect relationship between the predictors and the target. The main challenges with such an approach are that:

1. The focus remains on the relationship between predictors and target, and not on the inter-relationship between the predictor attributes. A simple example is a categorical variable with various states

2. It is assumed that each of the predictors has a direct relationship with the target variables, while, in reality, the variables could be correlated and connected. Also, the influence of one predictor on the other is ignored while calculating the overall impact of the predictor on the target. For example, in almost every problem, we try handling multicollinearity and choosing a correlation cut-off to disregard the near collinear variables. Nevertheless, complete removal of multicollinearity in the model is rare

The Bayesian Network can be utilized to address this challenge. It helps in understanding the drivers without ignoring the relationship among variables. It also provides a framework for the prior assessment of the impact of any actions that have to be taken to improve the outcome. A unique feature of this approach is that it allows for the propagation of evidence through the network.

Before getting into the details of driver analysis using Bayesian Network, let us discuss the following:

1. The Bayesian Belief Network

2. Basic concepts behind the BBN

3. Belief Propagation

4. Constructing a discrete Bayesian Belief Network

1. The Bayesian Belief Network

A Bayesian Belief Network (BBN) is a computational model that is based on graph probability theory. The structure of BBN is represented by a Directed Acyclic Graph (DAG). Formally, a DAG is a pair (N, A), where N is the node-set, and A is the arc-set. If there are two nodes u and v belonging to N, and there is an arc going from u to v, then u is termed as the parent of v and v is called the child of u. In terms of the cause-effect relationship, u is the cause, and v is the effect. A node can be a parent of another node while also being the child to a different node. An example is illustrated in the image below-

a and b are parents of u, and u is the child of a and b. At the same time, u is also a single parent for v. With respect to the cause-effect relationship, a and b are direct causes of u, and u is directly causing v, implying that a and b are indirectly responsible for the occurrence of v.

2. Basic concepts behind the BBN

The Bayesian Belief Network is based on the Bayes Theorem. A brief overview of it is provided below-

Bayesian Theorem

For two random variables X and Y, the following equation holds-

If the X and Y are independent of each other then

Joint Probability

Given X1, X2,………,Xn are features (nodes) in a BBN. Then Joint probability is defined as:

Marginal Probability

Given the joint probability, marginal probability of X1 = x0 is calculated as:

where x2 ,x3,…xn are the set of values corresponding to X2, X3,…..,Xn

3. Belief Propagation

Now let us try to understand belief propagation with the help of an example. We will consider an elementary network.

Now the above network says that the Train strike influences Allen’s and Kelvin’s work timings. And the probabilities are distributed as below:

Given we know Train Strike probability P(T) and conditional probabilities P(K|T) and P(A|T). we can calculate P(A) and P(K).

P(A= Y) = |T,K P(A=Y,T,K) = P(A=Y,K=Y,T=Y) + P(A=Y,K=N,T=Y) + P(A=Y,K=Y,T=N) + P(A=Y,K=N,T=N)

= P(T=Y)*P(A=Y|T=Y)*P(K=Y|T=Y) + P(T=Y)*P(A=Y|T=Y)*P(K=N|T=Y)

+ P(T=N)*P(A=Y|T=N)*P(K=Y|T=N) + P(T=N)*P(A=Y|T=N)*P(K=N|T=N)

= P(A=Y|T=Y)*P(T=Y) + P(A=Y|T=N)*P(T=N) …………………………………..(As, P(K=Y|T=Y) + P(K=N|T=Y) =1 )

= 0.7*0.1 + 0.6*0.9 = 0.61

Similarly, making the above formulation shorter.

P(K= Y) = P(K=Y|T=Y)*P(T=Y) + P(K=Y|T=N)*P(T=N)

= 0.6*0.1 + 0.1*0.9 = 0.15

Now, let us say we came to know that Allen is late, but we do not know if there is a train strike. Can we know the probability that Kelvin will be late given we know that Allen is late? Let us see how the evidence that is Allen is late is propagating through the network.

Let us estimate the probability of the train strike given we already know Allen is late.

P(T=Y|A=Y) = P(A=Y|T=Y) * P(T=Y)/P(A=Y) = 0.7*0.1/0.61 = 0.12

The above calculation tells us that if Allen is late, then the probability that there is a train strike is 0.12. We can use this updated belief of the train strike to calculate the probability of Kelvin being late.

P(K=Y) = P(K=Y,T=Y) + P(K=Y,T=N) = P(K=Y|T=Y)*P(T=Y) + P(K=Y|T=N)*P(T=N)

= 0.6*0.12 + 0.1*0.88 = 0.16

This gives a slight increase in the probability of Kelvin being late. So, the evidence that Allen is late propagates through the network and changes the belief of the train strike and of Kelvin being late.

4. Constructing a discrete Bayesian Belief Network

BBN can be constructed using only continuous variables, only categorical variables, or a mix of variables. Here, we will discuss discrete BBN, which is built using categorical variables only. There are two major constituents to constructing a BBN: Structure Learning and Parameter Learning.

Structure Learning

Structure learning is the basis of Bayesian Belief Network analysis. The effectiveness of the solution depends on the optimality of the learned structure. We can use the following approaches:

1. Create a structure based on domain knowledge and expertise.

2. Create an optimal local structure using machine learning algorithms. Please note that an optimal global structure is an NP-hard problem. There are many algorithms to learn the structure like K2, hill climbing and tabu, etc. You can learn more about these from the bnlearn package in R. Python aficionados can refer to the following link- https://pgmpy.chrisittner.de/

3. Create a structure using a combination of both the above approaches- use machine learning techniques to build the model, and with the reduced set of explanatory variables, use domain knowledge/expertise to create the structure. Of all the three, this is the quickest and most effective way.

Parameter Learning

Another major component of BBN is the Conditional Probability Table (CPT). Since each node in the structure is a random variable, it can take multiple values/states. Each state will have some probability of occurrence. We call these probabilities as Beliefs. Also, each node is connected to other nodes in the network. As per the structure, we learn the conditional probability of each state of a node. The tabular form of all such probabilities is called CPT.

Conclusion

This blog aims to equip you with the bare minimum concepts that are required to construct a discrete BBN and understand its various components. Structure learning and Parameter learning are the two major components that are necessary to build a BBN. The concepts of Bayes theorem, joint and marginal probability work as the base for the Network while the propagation of evidence is required to understand the functionality of BBN.

BBN can be used like any other machine learning technique. However, it works best where there are interdependencies among the predictors, and the number of predictors is less. The best part of the BBN is its intuitive way of explaining the drivers of evidence.

Stay tuned to this space to learn how this concept can be applied for event prediction, driver analysis, and intervention assessment of a given classification problem.

The post A Beginner’s Guide: Enter the World of Bayesian Belief Networks appeared first on Tiger Analytics.

How to Implement ML Models: Azure and Jupyter for Production

onemg — Thu, 28 May 2020 20:46:56 +0000

Introduction

As Data Scientists, one of the most pressing challenges we have is how to operationalize machine learning models so that they are robust, cost-effective, and scalable enough to handle the traffic demand. With advanced cloud technologies and serverless computing, there are now cost-effective (pay based on usage) and auto-scalable platforms (with scale-in/scale-out architecture depending on the traffic) available. Data scientists can use these to accelerate the machine learning model deployment without having to worry about the infrastructure.

This blog discusses one such methodology of implementing the machine learning code and model developed locally using Jupyter notebook in the Azure environment for real-time predictions.

ML Implementation Architecture

We have used Azure Functions to deploy the Model Scoring and Feature Store Creation codes into production. Azure Functions is a FaaS offering (Function as a Service or FaaS provides event-based, serverless computing to accelerate development without having to worry about the infrastructure). Azure Functions comes with some interesting functionalities like-

1. Choice of Programming Languages

You can work with any language of your choice- C#, Node.js, Java, Python

2. Event-driven and Scalable

You can use built-in triggers and bindings such as http trigger, event trigger, timer trigger, and queue trigger to define when a function is invoked. The architecture is scalable, depending on the workload.

ML Implementation process

Once the code is developed, the following are the best practices to make the machine learning code production-ready. Below are the steps to deploy the Azure Function.

ML Implementation Process

Azure Function Deployment Steps Walkthrough

Visual Studio Code editor with Azure Function extension is used to create a serverless HTTP endpoint with Python.

1. Sign in to Azure

2. Create a New Project. In the prompt that shows up, select the Language as Python, Trigger as http trigger (based on the requirement)

3. Azure Function is created, and the folder structure is as follows. Write your logic or copy the code if already developed into __init__.py

4. function.json, triggered by http trigger, defines the bindings in this case

5. local settings.json contains all the environmental variables used in the code as a key-value pair

6. requirements.txt contains all libraries that need to be pip installed

7. As the model is stored in Blob, add the following line of code to read from Blob

8. Read the Feature Store data from Azure SQL DB

9. Test locally. Choose Debug -> Start Debugging; it will run locally and give a local API endpoint

10. Publish to Azure Account using the following

func azure functionapp publish functionappname functionname — build remote — additional-packages “python3-dev libevent-dev unixodbc-dev build-essential libssl-dev libffi-dev”

11. Log in to Azure Portal and go to Azure Functions resource to get the API endpoint for Model Scoring

Conclusion

This API can also be integrated with front-end applications for real-time predictions.

Happy Learning!

The post How to Implement ML Models: Azure and Jupyter for Production appeared first on Tiger Analytics.