ML Archives - Tiger Analytics

AI in Beauty: Decoding Customer Preferences with the Power of Affinity Embeddings

onemg — Fri, 11 Apr 2025 10:48:21 +0000

Picture this: The data engineering team at a leading retail chain is tasked with integrating customer data from every touchpoint — purchase histories, website clicks, and social media interactions — to create personalized shopping experiences. The goal? To leverage this data for everything from predictive product recommendations to dynamic pricing strategies and targeted marketing campaigns. But the challenge isn’t just in collecting this data; it’s in understanding how to embed it across multiple customer interactions seamlessly while ensuring compliance with privacy regulations and safeguarding customer trust.

The beauty industry today is embracing cutting-edge technology to stay ahead of microtrends and streamline product development, all while improving efficiency and innovation in an increasingly fast-paced market. Brand loyalty is no longer solely dictated by legacy brands; “digital-first” challengers are capitalizing on changing consumer preferences and behaviors. Global Industry Cosmetics Magazine found that 6.2% of beauty sales now come from social selling platforms, with TikTok alone capturing 2.6% of the market. Nearly 41% of all beauty and personal care product sales now happen online, according to NielsenIQ’s Global State of Beauty 2025 report.

Virtual try-on apps, AI/ML-based product recommendations, smart applicators for hair and skin products – the beauty industry is testing, scaling, and rapidly deploying solutions to satisfy consumers who demand personalized experiences that cater to their unique needs. Based on our observations, and conversations with leaders in beauty & cosmetics retailing, we found that to thrive in this dynamic landscape, there is a need to move beyond traditional customer segmentation and delve deeper into customer product affinity.

What is customer product affinity embedding?

Imagine a complex map where customers and products are not locations but points in a multidimensional space. Customer product affinity embedding uses advanced machine learning algorithms to analyze vast amounts of data – everything from purchase history and browsing behavior to customer reviews and social media interactions. Upon processing this data, the algorithms create a map where customers (those who have opted-in, and anonymized, of course) and products are positioned based on the strength of their relationship, with proximity reflecting the degree of relevance, preference, and engagement between them. In short, it helps capture the essence of customer preferences in a mathematical representation.

This approach provides businesses with a deeper understanding of customer-product affinities. At Tiger Analytics, we partnered with a leading beauty retailer to design a system that captures the true essence of customer preferences by focusing on both customer and product nuances. It begins with a harmonized product taxonomy and sanitized product attributes, and incorporates curated customer data from transactions, interactions, and browsing behavior. Together, these elements create an accurate and comprehensive view of customer affinities, allowing businesses to tailor strategies with greater precision.

How does customer product affinity embedding transform business decisions?

Customer product affinity embedding enhances decision-making by capturing the multidimensional interactions between business efforts, customer activities, product characteristics, and broader macroeconomic conditions. Unlike conventional machine learning approaches, which typically focus on solving one problem at a time and require custom feature engineering, this method integrates diverse business signals. Traditional approaches often isolate and aggregate these signals, but they fail to explain the overall variance or underlying business causality, limiting their effectiveness.

Caption: Matrix factorization foundation for affinity matrix

By incorporating deeper insights into customer preferences and behaviors into business strategy, beauty retailers can unlock greater efficiency, relevance, and personalization across various touchpoints. Below are a few ways customer affinity embedding can bring tangible advantages:

Hyper-Personalized Recommendations: Take, for instance, suggesting a hydrating toner to someone with a high affinity for high-coverage foundations, thereby providing them with relevant products that match their needs and preferences.
Smart Product Substitutions: Sarah, a regular purchaser of the ‘Sunset Shimmer’ eyeshadow palette, gets a notification suggesting a substitute — ‘Ocean Breeze’, a cool-toned palette with hues she may also enjoy — when her favorite product is out of stock.
Store Inventory Optimizations: Optimizing inventory levels by predicting demand based on customer affinity. Businesses can avoid stockouts for high-affinity products and minimize dead stock for low-affinity ones, leading to reduced costs and improved customer satisfaction.
Personalized Search: Traditional search relies on keywords and filters. However, these methods often miss the nuances of the customer’s intent. For example, a search for “foundation” might be someone seeking full coverage, or someone wanting a lightweight, dewy finish. Affinity embedding helps bridge this gap, ensuring more relevant search results.
Targeted Marketing Campaigns: Consider targeting millennials with a strong affinity for Korean beauty with social media campaigns showcasing the latest K-beauty trends.
Data-Driven Product Development: If a significant customer segment shows a high affinity for vegan beauty products, but limited options are available, the brand can proactively develop a high-quality vegan makeup line to fill that gap in the market.
Personalized Buying Journey: Picture a customer searching for false eyelashes on the app, and then being recommended complementary items like glue and party essentials. Additionally, the system can suggest popular shades previously chosen by customers with similar preferences, creating a seamless and personalized shopping experience.

These are just a few examples of how customer affinity embedding can enhance customer engagement and improve the overall shopping experience. Other use cases, such as Trip Mission Basket Builder, Dynamic Pricing/Discounting, and Subscription Box Optimization further demonstrate how this technology can revolutionize customer satisfaction and business efficiency.

Real-world impact of customer affinity embedding on sales and engagement

Customer affinity embedding is a multi-step process that converts customer data points into a mathematical representation that captures the strength of a customer’s relationship with various products.

Caption: Functional architecture

The same embedding features can be transformed into affinity ranks, which serve as inputs for downstream ML models to generate personalized recommendations and provide insights such as:

Product Similarity
Customer Similarity
Customer Affinity to Products
Product Substitutions

Through our collaboration, the beauty retailer experienced a 4.5% increase in repeat purchases over a 12-month period. Additionally, the brand saw a 3.5% average boost in customer engagement scores within the fashion category, and a 7.8% rise in app usage. The company’s ROI for marketing campaigns also improved, with a 23-basis-point increase across digital channels.

Today, the real question isn’t just ‘what does the customer want?’ – it’s ‘how can we truly understand and deliver it?’

Understanding customer needs isn’t just about analyzing past behaviors, but rather predicting intent and adapting in real time. Customers don’t always explicitly state their preferences. Their choices are shaped by trends, context, and discovery. The challenge for brands is to move from reactive insights to proactive personalization, ensuring that every recommendation, search result, and marketing touchpoint feels intuitive rather than intrusive.

Customer product affinity embedding brings brands closer to the customer by placing the consumer at the heart of every decision. With data-driven customer understanding, brands can build deeper and more personalized connections, driving loyalty and growth.

References:

https://shop.nielseniq.com/product/global-state-of-beauty-2025/
https://www.gcimagazine.com/brands-products/skin-care/news/22916897/2024s-global-beauty-sales-are-powered-by-an-ecommerce-social-selling-boom/

The post AI in Beauty: Decoding Customer Preferences with the Power of Affinity Embeddings appeared first on Tiger Analytics.

Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security

TA@2023 — Thu, 07 Mar 2024 07:03:07 +0000

Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing. Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

Automate the discovery of sensitive data.
Discover a variety of sensitive data types.
Evaluate and monitor data for security and access control.
Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
- triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
- pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
- checkJobStatus: This Lambda function checks the status of the Macie scan job.
- isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
- waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
- UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.

A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
- We create a one-time Macie job through the triggerMacieJob Lambda function.
- We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
- While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.
  
  The findings can be seen on the AWS console and filtered based on S3 Buckets. We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.
The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
- In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
- In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.
  For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848
  
  The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.
- If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
- In the quarantine bucket, the same file is being moved after finding the sensitive data.
  
  If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
- GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
- StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
- StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
- StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

Data Science Strategies for Effective Process System Maintenance

onemg — Mon, 20 Dec 2021 16:42:57 +0000

Data Science applications are gaining significant traction in the preventive and predictive maintenance of process systems across industries. A clear mindset shift has made it possible to steer maintenance from using a ‘reactive’ (using a run-to-failure approach) to one that is proactive and preventive in nature.

Planned or scheduled maintenance uses data and experiential knowledge to determine the periodicity of servicing required to maintain the plant components’ good health. These are typically driven by plant maintenance teams or OEMs through maintenance rosters and AMCs. Unplanned maintenance, on the other hand, occurs at random, impacts downtime/production, safety, inventory, customer sentiment besides adding to the cost of maintenance (including labor and material).

Interestingly, statistics reveal that almost 50% of the scheduled maintenance projects are unnecessary and almost a third of them are improperly carried out. Poor maintenance strategies are known to cost organizations as much as 20% of their production capacity – shaving off the benefits that a move from reactive to preventive maintenance approach would provide. Despite years of expertise available in managing maintenance activities, unplanned downtime impacts almost 82% of businesses at least once every three years. Given the significant impact on production capacity, aggregated annual downtime costs for the manufacturing sector are upwards of $50 billion (WSJ) with average hourly costs of unplanned maintenance in the range of $250K.

It is against this backdrop that data-driven solutions need to be developed and deployed. Can Data Science solutions bring about significant improvement of the maintenance domain and prevent any or all of the above costs? Are the solutions scalable? Do they provide an understanding of what went wrong? Can they provide insights into alternative and improved ways to manage planned maintenance activities? Does Data Science help reduce all types of unplanned events or just a select few? These are questions that manufacturers need to be answered and it is for the experts from both maintenance and data science domains to address.

Industry understanding of managing planned maintenance is fairly mature. The highlight of this article is therefore focused on unplanned maintenance, which demands a differentiated approach to build insight and understanding around the process and subsystems.

Data Science solutions are accelerating the industry’s move towards ‘on-demand’ maintenance wherein interventions are made only if and when required. Rather than follow a fixed maintenance schedule, data science tools can now aid plants to increase run lengths between maintenance cycles in addition to improving plant safety and reliability. Besides the direct benefits that result in reduced unplanned downtime and cost of maintenance, operating equipment at higher levels of efficiency improves the overall economics of operation.

The success of this approach was demonstrated in refinery CDU preheat trains that use soft sensing triggers to decide when to process ‘clean crude’ (to mitigate the fouling impact) or schedule maintenance of fouled exchangers. Other successes were in the deployment of plant-wide maintenance of control valves, multiple-effect evaporators in plugging service, compressors in petrochemical service, and a geo-wide network of HVAC systems.

Instead of using a fixed roster for maintenance of PID control valves, plants can now detect and diagnose control valves that are malfunctioning. Additionally, in combination with domain and operations information, it can be used to suggest prescriptive actions such as auto-tuning of the valves, which improve maintenance and operations metrics.

Reducing unplanned, unavoidable events

It is important to bear in mind that not all unplanned events are avoidable. The inability to avoid events could be either because they are not detectable enough or because they are not actionable. The latter could occur either because the response time available is too low or because the knowledge to revert a system to its normal state does not exist. A large number of unplanned events however are avoidable, and the use of data science tools improves their detection and prevention with greater accuracy.

The focus of the experts working in this domain is to reduce unplanned events and transition events from unavoidable to avoidable. Using advanced tools for detection, diagnosis, and enabling timely actions to be taken, companies have managed to reduce their downtime costs significantly. The diversity of solutions that are available in the maintenance area covers both plant and process subsystems.

Some of the data science techniques deployed in the maintenance domain are briefly described below:

Condition Monitoring
This has been used to monitor and analyze process systems over time, and predict the occurrence of an anomaly. These events or anomalies could have short or long propagation times such as the ones seen in the fouling in exchangers or in the cavitation in pumps. The spectrum of solutions in this area includes real-time/offline modes of analysis, edge/IoT devices and open/closed loop prescriptions, and more. In some cases, monitoring also involves the use of soft sensors to detect fouling, surface roughness, or hardness – these parameters cannot be measured directly using a sensor and therefore, need surrogate measuring techniques.

Perhaps one of the most unique challenges working in the manufacturing domain is in the use of data reconciliation. Sensor data tend to be spurious and prone to operational fluctuations, drift, biases, and other errors. Using raw sensor information is unlikely to satisfy the material and energy balance for process units. Data reconciliation uses a first-principles understanding of the process systems and assigns a ‘true value’ to each sensor. These revised sensor values allow a more rigorous approach to condition monitoring, which would otherwise expose process systems to greater risk when using raw sensor information. Sensor validation, a technique to analyze individual sensors in tandem with data reconciliation, is critical to setting a strong foundation for any analytics models to be deployed. These elaborate areas of work ensure a greater degree of success when deploying any solution that involves the use of sensor data.

Fault Detection
This is a mature area of work and uses solutions ranging from those that are driven entirely by domain knowledge, such as pump curves and detection of anomalies thereof, to those that rely only on historical sensor/maintenance/operations data for analysis. An anomaly or fault is defined as a deviation from ‘acceptable’ operation but the context and definitions need to be clearly understood when working with different clients. Faults may be related to equipment, quality, plant systems, or operability. A good business context and understanding of client requirements are necessary for the design and deployment of the right techniques. From basic tools that use sensor thresholds, run charts, and more advanced techniques such as classification, pattern analysis, regression, a wide range of solutions can be successfully deployed.

Early Warning Systems
The detection of process anomalies in advance helps in the proactive management of abnormal events. Improving actionability or response time allows faults to be addressed before setpoints/interlocks are triggered. The methodology varies across projects and there is no ‘one-size-fits-all’ approach. Problem complexity could range from using single sensor information as lead indicators (such as using sustained pressure loss in a vessel to identify a faulty gasket that might rupture) to far more complex methods of analysis.

Typical challenges faced in developing early warning systems are in the 100% detectability of anomalies but an even larger challenge is in filtering out false indications of anomalies. The detection of 100% of the anomalies and the robust filtering techniques are critical factors to consider for successful deployment.

Enhanced Insights for Fault Identification
The importance of detection and response time in the prevention of an event cannot be overstated. But what if an incident is not easy to detect or the propagation of the fault is too rapid to allow us any time for action? The first level involves using machine-driven solutions for detection such as computer vision models, which are rapidly changing the landscape. Using these models, it is now possible to improve prediction accuracies of processes that were either not monitored or used manual monitoring. The second is to integrate the combined expertise of personnel from various job functions such as technologists, operators, maintenance engineers, and supervisors. At this level of maturity, the solution is able to baseline with the best that current operations aim to achieve. The third and by far the most complex is to move more faults in the ‘detectable’ and actionable realm. One such case was witnessed in a complex process from the metal smelting industry. Advanced-Data Science techniques using a digital twin amplified signal responses and analyzed multiple process parameters to predict the occurrence of an incident ahead of time. By gaining order of magnitude improvement in response time, it was possible to move the process fault from an unavoidable to an avoidable and actionable category.

With the context provided above, it is possible to choose a modeling approach and customize the solutions to suit the problem landscape:

Different approaches to Data Analytics

Domain-driven solution
First-principles and the rule-based approach is an example of a domain-driven solution. Traditional ways of delivering solutions for manufacturing often involve computationally intensive solutions (such as process simulation, modeling, and optimization). In one of the difficult-to-model plants, deployment was done using rule engines that allow domain knowledge and experience to determine patterns and cause-effect relationships. Alarms were triggered and advisories/recommendations were sent to the concerned stakeholders regarding what specific actions to undertake each time the model identified an impending event.

Domain-driven approaches also come in handy in the case of ‘cold start’ where solutions need to be deployed with little or no data availability. In some deployments in the mechanical domain, the first-principles approach helped identify >85% of the process faults even at the start of operations.

Pure data-driven solutions
A recent trend seen in the process industry is the move away from domain-driven solutions due to challenges in getting the right skills to deploy solutions, computation infrastructure requirements, customized maintenance solutions, and the requirement to provide real-time recommendations. Complex systems such as naphtha cracking, alumina smelting which are hard to model have harnessed the power of data science to not just diagnose process faults but also enhance response time and bring more finesse to the solutions.

In some cases, domain-driven tools have provided high levels of accuracy in analyzing faults. One such case was related to compressor faults where domain data was used to classify them based on a loose bearing, defective blade, or polymer deposit in the turbine subsystems. Each of these faults was identified using sensor signatures and patterns associated with it. Besides getting to the root cause, this also helped prescribe action to move the compressor system away from anomalous operation.

These solutions need to bear in mind that the operating envelope and data availability covers all possible scenarios. The poor success of deployments using this approach is largely due to insufficient data that covers plant operations and maintenance. However, the number of players offering a purely data-driven solution is large and soon replacing what was traditionally part of a domain engineer’s playbook.

Blended solutions
Blended solutions for the maintenance of process systems combine the understanding of both data science and domain. One such project was in the real-time monitoring and preventive maintenance of >1200 HVAC units across a large geographic area. The domain rules were used to detect and diagnose faults and also identify operating scenarios to improve the reliability of the solutions. A good understanding of the domain helps in isolating multiple anomalies, reducing false positives, suggesting the right prescriptions, and more importantly, in the interpretability of the data-driven solutions.

The differentiation comes from using the combined intelligence from AI / ML models, domain knowledge and knowledge of deployment success are integrated into the model framework.

Customizing the toolkit and determining the appropriate modeling approach are critical to delivery. Given the uniqueness of each plant and problem and the requirement for a high degree of customization, makes the deployment of solutions in a manufacturing environment is fairly challenging. This fact is validated by the limited number of solution providers serving this space. However, the complexity and nature of the landscape need to be well understood by both the client and the service provider. It is important to note that not all problems in the maintenance space are ‘big data’ problems requiring analysis in real-time, using high-frequency data. Some faults with long propagation times can use values averaged over a period of time while other systems with short response time requirements may require real-time data. Where maintenance logs and annotations related to each event (and corrective action) are recorded, one could go with a supervised learning approach, but this is not always possible. In cases where data on faults and anomalies are not available, a one-class approach to classify the operation into normal/abnormal modes has also been used. Solution maturity improves with more data and failure modes identified over time.

A staged solution approach helps in bringing in the right level of complexity to deliver solutions that evolve over time. Needless to say, it takes a lot of experience and prowess to marry the generalized understanding with the customization that each solution demands.

Edge/IoT

A fair amount of investment needs to be made at the beginning of the project to understand the hardware and solution architecture required for successful deployment. While the security of data is a primary consideration, other factors such as computational power, cost, time, response time, open/closed-loop architecture are added considerations in determining the solution framework. Experience and knowledge help understand additional sensing requirements and sensor placement, performance enhancement through edge/cloud-based solutions, data privacy, synchronicity with other process systems, and much more.

By far, the largest challenge is witnessed on the data front (sparse, scattered, unclean, disorganized, unstructured, not digitized, and so on) that prevent businesses from seeing quick success. Digitization and creating data repositories, which set the foundation for model development, take a lot of time.

There is also a multitude of control systems, specialized infrastructure, legacy systems within the same manufacturing complex that one may need to work through. End-to-end delivery with the front-end complexity in data management creates a significant entry barrier for service providers in the maintenance space.

Maintenance cuts across multiple layers of a process system. The maintenance solutions vary as one moves from a sensor to a control loop, equipment with multiple control valves all the way to a flowsheet/enterprise layer. Maintenance across these layers requires a deep understanding of both the hardware as well as process aspects, a combination that is often hard to put together. Sensors and control valves are typically maintained by those with an Instrumentation background, while equipment maintenance could fall in a mechanical or chemical engineer’s domain. On the other hand, process anomalies that could have a plant-level impact are often in the domain of operations/technology experts or process engineers.

Data Science facilitates the development of insights and generalizations required to build understanding around a complex topic like maintenance. It helps in the generalization and translation of learnings across layers within the process systems from sensors all the way to enterprise and other industry domains as well. It is a matter of time before analytics-driven solutions that help maintain safe and reliable operations become an integral part of plant operations and maintenance systems. We need to aim towards the successes that we witness in the medical diagnostics domain where intelligent machines are capable of detecting and diagnosing anomalies. We hope that similar analytics solutions will go a long way to keep plants safe, reduce downtime and provide the best of operations efficiencies that a sustainable world demands.

Today, the barriers to success are in the ability to develop, a clear understanding of the problem landscape, plan end-to-end and deliver customized solutions that take into account business priorities and ROI. Achieving success at a large scale will demand reducing the level of customization required in each deployment – a constraint that is overcome by few subject matter experts in the area today.

The post Data Science Strategies for Effective Process System Maintenance appeared first on Tiger Analytics.

Defining Financial Ethics: Transparency and Fairness in Financial Institutions’ use of AI and ML

onemg — Fri, 10 Dec 2021 19:35:26 +0000

The last few years have seen a rapid acceleration in the use of disruptive technologies such as Machine Learning and Artificial Intelligence in financial institutions (FI). Improved software and hardware, coupled with a digital-first outlook, has led to a steep rise in the use of such applications to advance outcomes for consumers and businesses alike.

By embracing AI/ML, the early adopters in the industry have been able to streamline decision processes involving large amounts of data, avoid bias, and reduce chances of error and fraud. Even the more traditional banks are investing in AI systems are using state-of-the-art ML and deep learning algorithms that have paved the way for quicker and better reactions to the changing consumer needs and market dynamics.

The Covid-19 pandemic has only aided in making the use of AI/ML-based tools more widespread and easily scalable across sectors. At Tiger Analytics, we have been at the heart of the action and have assisted several clients to reap the benefits of AI/ML across the value chain.
Pilot-use cases where FIs have seen success by using AI/ML-based solutions:

Smarter risk management
Real-time investment advice
Enhanced access to credit
Automated underwriting
Intelligent customer service and chatbots

The challenges

While time, cost, and efficiency have seen drastic improvement thanks to AI/ML, concerns over transparency, accountability, and inclusivity prevail. Given how highly regulated and impactful the industry is, it becomes pertinent to maintain a sense of clarity and inclusiveness.
Problems in governance of AI/ML:

Transparency
Fairness
Bias
Reliability/soundness
Accountability

How can we achieve this? By, first and foremost, finding and evaluating safe and responsible ways to integrate AI/ML into everyday processes to better suit the needs of clients and customers.

By making certain guidelines uniform and standardized, we can set the tone for successful AI/ML implementation. This involves robust internal governance processes and frameworks, as well as timely interventions and checks, as outlined in Tiger’s response document and comments to the regulatory agencies in the US.

These checks become even more relevant where regulatory standards or guidance are inadequate specifically on the use of AI in the FI. However, efforts are being made to hold FIs against some kind of standard.

The table below illustrates the issuance of AI guidelines across different countries:

Source: FSI Insights on Policy Implementation No. 35, By Jeremy Prenio & Jeffrey Yong, August 2021

Supervisory guidelines and regulations must be understood and customized to suit the needs of the various sectors.

To overcome these challenges, this step of creating uniform guidance by the regulatory agencies is essential — it opens up a dialogue on the usage of AI/ML-based solutions, and also brings in different and diverse voices from the industry to share their triumphs and concerns.

Putting it out there

As a global analytics firm that specializes in creating bespoke AI and ML-based solutions for a host of clients, at Tiger, we recognize the relevance of a framework of guidelines that enable feelings of trust and responsibility.

It was this intention of bringing in more transparency that led us to put forward our response to the Request for Information and Comment on Financial Institutions’ Use of Artificial Intelligence, including Machine Learning (RFI) by the following agencies:

Board of Governors of the Federal Reserve System (FRB)
Bureau of Consumer Financial Protection (CFPB)
Federal Deposit Insurance Corporation (FDIC)
National Credit Union Administration (NCUA) and,
Office of the Comptroller of the Currency (OCC)

Our response to the RFI is structured in such a way that it is easily accessible to even those without the academic and technical knowledge of AI/ML. We have kept the conversation generic, steering away from deep technical jargon in our views.

Ultimately, we recognize that the role of regulations around models involving AI and ML is to create fairness and transparency for everyone involved.

Transparency and accountability are foundation stones at Tiger too, which we apply and deploy while developing powerful AI and ML-based solutions to our clients — be it large or community banks, credit unions, fintech, and other financial services.

We are eager to see the outcome of this exercise and hope that it will result in consensus and uniformity of definitions, help in distinguishing facts from myth, and allow for a gradation of actual and perceived risks arising from the use of AI and ML models.

We hope that our response not only highlights our commitment to creating global standards in AI/ML regulation, but also echoes Tiger’s own work culture and belief system of fairness, inclusivity, and equality.

Want to learn more about our response? Refer to our recent interagency submission.

The post Defining Financial Ethics: Transparency and Fairness in Financial Institutions’ use of AI and ML appeared first on Tiger Analytics.

Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins

onemg — Thu, 24 Dec 2020 18:19:09 +0000

Historically, manufacturing equipment maintenance has been done during scheduled service downtime. This involves periodically stopping production for carrying out routine inspections, maintenance, and repairs. Unexpected equipment breakdowns disrupt the production schedule; require expensive part replacements, and delay the resumption of operations due to long procurement lead times.

Sensors that measure and record operational parameters (temperature, pressure, vibration, RPM, etc.) have been affixed on machinery at manufacturing plants for several years. Traditionally, the data generated by these sensors was compiled, cleaned, and analyzed manually to determine failure rates and create maintenance schedules. But every equipment downtime for maintenance, whether planned or unplanned, is a source of lost revenue and increased cost. The manual process was time-consuming, tedious, and hard to handle as the volume of data rose.

The ability to predict the likelihood of a breakdown can help manufacturers take pre-emptive action to minimize downtime, keep production on track, and control maintenance spending. Recognizing this, companies are increasingly building both reactive and predicted computer-based models based on sensor data. The challenge these models face is the lack of a standard framework for creating and selecting the right one. Model effectiveness largely depends on the skill of the data scientist. Each model must be built separately; model selection is constrained by time and resources, and models must be updated regularly with fresh data to sustain their predictive value.

As more equipment types come under the analytical ambit, this approach becomes prohibitively expensive. Further, the sensor data is not always leveraged to its full potential to detect anomalies or provide early warnings about impending breakdowns.

In the last decade, the Industrial Internet of Things (IIoT) has revolutionized predictive maintenance. Sensors record operational data in real-time and transmit it to a cloud database. This dataset feeds a digital twin, a computer-generated model that mirrors the physical operation of each machine. The concept of the digital twin has enabled manufacturing companies not only to plan maintenance but to get early warnings of the likelihood of a breakdown, pinpoint the cause, and run scenario analyses in which operational parameters can be varied at will to understand their impact on equipment performance.

Several eminent ‘brand’ products exist to create these digital twins, but the software is often challenging to customize, cannot always accommodate the specific needs of each and every manufacturing environment, and significantly increases the total cost of ownership.

ML-powered digital twins can address these issues when they are purpose-built to suit each company’s specific situation. They are affordable, scalable, self-sustaining, and, with the right user interface, are extremely useful in telling machine operators the exact condition of the equipment under their care. Before embarking on the journey of leveraging ML-powered digital twins, certain critical steps must be taken:

1. Creation of an inventory of the available equipment, associated sensors and data.

2. Analysis of the inventory in consultation with plant operations teams to identify the gaps. Typical issues may include missing or insufficient data from the sensors; machinery that lacks sensors; and sensors that do not correctly or regularly send data to the database.

3. Coordination between the manufacturing operations and analytics/technology teams to address some gaps: installing sensors if lacking (‘sensorization’); ensuring that sensor readings can be and are being sent to the cloud database; and developing contingency approaches for situations in which no data is generated (e.g., equipment idle time).

4. A second readiness assessment, followed by a data quality assessment, must be performed to ensure that a strong foundation of data exists for solution development.

This creates the basis for a cloud-based, ML-powered digital twin solution for predictive maintenance. To deliver the most value, such a solution should:

Use sensor data in combination with other data as necessary
Perform root cause analyses of past breakdowns to inform predictions and risk assessments
Alert operators of operational anomalies
Provide early warnings of impending failures
Generate forecasts of the likely operational situation
Be demonstrably effective to encourage its adoption and extensive utilization
Be simple for operators to use, navigate and understand
Be flexible to fit the specific needs of the machines being managed

When model-building begins, the first step is to account for the input data frequency. As sensors take readings at short intervals, timestamps must be regularized and resamples taken for all connected parameters where required. At this time, data with very low variance or too few observations may be excised. Model data sets containing sensor readings (the predictors) and event data such as failures and stoppages (the outcomes) are then created for each machine using both dependent and independent variable formats.

To select the right model for anomaly detection, multiple models are tested and scored on the full data set and validated against history. To generate a short-term forecast, gaps related to machine testing or idle time must be accounted for, and a range of models evaluated to determine which one performs best.

Tiger Analytics used a similar approach when building these predictive maintenance systems for an Indian multinational steel manufacturer. Here, we found that regression was the best approach to flag anomalies. For forecasting, the accuracy of Random Forest models was higher compared to ARIMA, ARIMAX, and exponential smoothing.

Using a modular paradigm to build ML-powered digital twin makes it straightforward to implement and deploy. It does not require frequent manual recalibration to be self-sustaining, and it is scalable so it can be implemented across a wide range of equipment with minimal additional effort and time.

Careful execution of the preparatory actions is as important as strong model-building to the success of this approach and its long-term viability. To address the challenge of low-cost, high-efficiency predictive maintenance in the manufacturing sector, employ this sustainable solution: a combination of technology, business intelligence, data science, user-centric design, and the operational expertise of the manufacturing employees.

This article was first published in Analytics India Magazine.

The post Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins appeared first on Tiger Analytics.

Building Data Engineering Solutions: A Step-by-Step Guide with AWS

onemg — Thu, 14 Feb 2019 18:10:43 +0000

Introduction:

Lots of small to midsize companies use Analytics to understand business activity, lower their costs and increase their reach. Some of these companies may intend to build and maintain an Analytics pipeline but change their mind when they see how much money and tech know-how it takes. For any enterprise, data is an asset. And they are unwilling to share this asset with external players: they might end up risking their market advantage. To extract maximum value from intelligence harvesting, enterprises need to build and maintain their own data warehouses and surrounding infrastructure.

The Analytics field is buzzing with talks on applications related to Machine Learning, which have complex requirements like storing and processing unstructured streaming data. Instead of pushing themselves towards advanced analytics, companies can extract a lot of value simply by using good reporting infrastructure. This is because currently a lot of SME activity is still at the batch data level. From an infrastructure POV, cloud players like Amazon Web Services (AWS) and Microsoft Azure have taken away a lot of complexity. This has enabled companies to implement an accurate, robust reporting infrastructure (more or less) independently and economically. This article is about a specific lightweight implementation of Data Engineering using AWS, which would be perfect for an SME. By the time you finish reading this, you will:

1) Understand the basics of a simple Data Engineering pipeline
2) Know the details of a specific kind of AWS-based Analytics pipeline
3) Apply this design thinking to a similar problem you may come across

Analytics Data Pipeline:

SMEs have their business activity data stored in different places. Getting it all together so that a broad picture of the business’s health emerges is one of the big challenges in analytics. Gathering data from sources, storing it in a structured and accurate manner, then using that data to create reports and visualizations can give SMEs relatively large gains. From a process standpoint, this is what it might look like:

Figure 1: Simple Data Pipeline

But from a business activity effort standpoint, it’s more like:

Figure 2: Business Activity involved in a Data Pipeline

Here’s what’s interesting: although the first two components of the process consume most time and effort, when you look at it from a value chain standpoint, value is realized in the Analyze component.

Figure 3: Analytics Value Chain

The curiously inverse relationship between effort and value keeps SMEs wondering if they will realize the returns they expect on their investment and minimize costs. Analytics today might seem to be all about Machine Learning and cutting-edge technology, but SMEs can realize a lot of value by using relatively simple analytics like:

1) Time series graph on business activity for leadership
2) Bar graph visualization for sales growth over the years
3) For the Sales team: a refreshed, filterable dashboard showing the top ten clients over a chosen time period
4) For the Operations team: an email blast every morning at eight depicting business activity expense over a chosen time period

Many strategic challenges that SMEs face, like business reorganization, controlling operating costs, crisis management, require accurate data to solve. Having an Analytics data pipeline in the cloud allows enterprises to take cost-optimized, data-driven decisions. These can include both strategic decision-making for C-Suite and business-as-usual metrics for the Operations and Sales teams, allowing executives to track their progress. In a nutshell, an Analytics data pipeline makes company information accessible to executives. This is valuable in itself because it enables metrics monitoring (including the derived benefits like forecasting predictions). There you have it, folks: a convincing case for SMEs to experiment with building an in-house Analytics pipeline.

Mechanics of the pipeline:

Before we get into vendors and the value they bring, here’s something for you to think about: there are as many ways to build an Analytics pipeline as there are stars in the sky. The challenge here is to create a data pipeline that is hosted on a secure cloud infrastructure. It’s important to use cloud-native compute and storage components so that the infrastructure is easy to build and operate for an SME.

Usually, source data for SMEs are in the following formats:

1) Payment information stored in Excel
2) Business activity information coming in as API
3) Third-party interaction exported as a .CSV to a location like S3

Using AWS as a platform enables SMEs to leverage the serverless compute feature of AWS Lambda when ingesting the source data into an Aurora Postgres RDBMS. Lambda allows many programming interfaces including Python, a widely used language. Back in 2016-17, the total runtime for Lambda was at five minutes, which was not nearly enough for ETL. Two years later, the limit was increased to 15 minutes. This is still too little time to execute most ETL jobs, but enough for the batch data ingestion requirements of SMEs.

Lambda is usually hosted within a private subnet in the enterprise Virtual Private Cloud (VPC), but it can communicate with third-party source systems through a Network Address Translator (NAT) and Internet Gateways (IG). Python’s libraries (like Pandas) make tabular data quick and easy to process. Once processed, the output dataframe from Lambda is stored onto a table in the Aurora Postgres Database. Aurora prefix is for the AWS flavor of the Postgres Database offering. It makes sense to choose a vanilla relational database because most data is in Excel-type rows and columns format anyway, and reporting engines like Tableau and other BI tools work well with RDBMS engines.

Mapping the components to the process outlined in Figure 1, we get:

Figure 4: Revisiting Analytics pipeline

AWS Architecture:

Let’s take a deeper look into AWS architecture.

Figure 5: AWS-based batch data processing architecture using Serverless Lambda function and RDS database

Figure 5 adds more details to the AWS aspects of a Data Engineering pipeline. Operating on AWS requires companies to share security responsibilities such as:

1) Hosting AWS components with a VPC
2) Identifying public and private subnets
3) Ensuring IG and NAT Gateways can allow components hosted within private subnets to communicate with the internet
4) Provisioning the Database as publicly not accessible
5) Setting aside a dedicated EC2 to route web traffic to this publicly inaccessible database
6) Provisioning security groups for EC2’s public subnet (Lambda in private subnet and Database in DB subnet)
7) Provisioning subnets for app and DB tier in two different Availability Zones (AZ) to ensure (a) DB tier provisioning requirements are met, and (b) Lambda doesn’t run out of IPs when triggered

Running the pipeline:

New data is ingested by timed invocation of Lambda using CloudWatch rules. CloudWatch monitors AWS resources and invokes services at set times using Chron expression. CloudWatch can also be effectively used as a SQL Server Job agent to trigger Lambda events. This accommodates activities with different frequencies like:

1) Refreshing sales activity (daily)
2) Operating Costs information (weekly)
3) Payment activity (biweekly)
4) Tax information (monthly)

CloudWatch can deploy a specific Python script (that takes data from the source, does necessary transformations, and loads it onto a table with known structure) to Lambda once the respective source file or data refresh frequency is known.

Moving on to Postgres, its unique Materialized View and SQL Stored procedure feature (that allows further processing) can also be invoked using a combination of Lambda and CloudWatch. This workflow is helpful to propagate base data after refresh into denormalized, wide tables which can store company-wide sales and operations information.

Figure 6: An example of data flow for building aggregate metrics

Once respective views are refreshed with the latest data, we can connect to the Database using a BI tool for reporting and analysis. It’s important to remember that because we are operating on the AWS ecosystem, the Database must be provisioned as publicly inaccessible and be hosted within a private subnet. Users should only be able to reach it through a web proxy, like nginx or httpd, that is set up on an EC2 on the public subnet to route traffic within the VPC.

Figure 7: BI Connection flow to DB

Access to data can be controlled at the Database level (by granting or denying access to a specific schema) and at the connection level (by whitelisting specific IPs to allow connections and denying connect access by default).

Accuracy is the name of the game:

So you have a really secure and robust AWS architecture, a well-tested Python code for Lambda executions, and a not-so-cheap BI tool subscription. Are you all set? Not really. You might just miss the bus if inaccuracy creeps into the tables during data refresh. A dashboard is only as good as the accuracy of the numbers it displays. Take extra care to ensure that the schema tables you have designed include metadata columns required to identify inaccurate and duplicate data.

Conclusion:

In this article, we took a narrow-angle approach to a specific Data Engineering example. We saw the Effort vs Return spectrum in the Analytics value chain and the value that can be harvested by taking advantage of the available Cloud options. We noted the value in empowering C-suite leaders and company executives with descriptive interactive dashboards.

We looked at building a specific AWS cloud-based Data Engineering pipeline that is relatively uncomplicated and can be implemented by SMEs. We went over the architecture and its different components and briefly touched on the elements of running a pipeline and finally, on the importance of accuracy in reporting and analysis.

Although we saw one specific implementation in this article, the attempt here is to convey the idea that getting value out of an in-house Analytics pipeline is easier than what it used to be say a decade ago. With open source and cloud tools here to make the journey easy, it doesn’t take long to explore and exploit the value hidden in data.

[References:

Disruptive Analytics, Apress, 2016]

The post Building Data Engineering Solutions: A Step-by-Step Guide with AWS appeared first on Tiger Analytics.