Machine Learning Archives - Tiger Analytics

Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security

TA@2023 — Thu, 07 Mar 2024 07:03:07 +0000

Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing. Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

Automate the discovery of sensitive data.
Discover a variety of sensitive data types.
Evaluate and monitor data for security and access control.
Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
- triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
- pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
- checkJobStatus: This Lambda function checks the status of the Macie scan job.
- isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
- waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
- UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.

A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
- We create a one-time Macie job through the triggerMacieJob Lambda function.
- We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
- While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.
  
  The findings can be seen on the AWS console and filtered based on S3 Buckets. We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.
The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
- In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
- In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.
  For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848
  
  The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.
- If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
- In the quarantine bucket, the same file is being moved after finding the sensitive data.
  
  If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
- GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
- StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
- StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
- StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

Data Science Strategies for Effective Process System Maintenance

onemg — Mon, 20 Dec 2021 16:42:57 +0000

Data Science applications are gaining significant traction in the preventive and predictive maintenance of process systems across industries. A clear mindset shift has made it possible to steer maintenance from using a ‘reactive’ (using a run-to-failure approach) to one that is proactive and preventive in nature.

Planned or scheduled maintenance uses data and experiential knowledge to determine the periodicity of servicing required to maintain the plant components’ good health. These are typically driven by plant maintenance teams or OEMs through maintenance rosters and AMCs. Unplanned maintenance, on the other hand, occurs at random, impacts downtime/production, safety, inventory, customer sentiment besides adding to the cost of maintenance (including labor and material).

Interestingly, statistics reveal that almost 50% of the scheduled maintenance projects are unnecessary and almost a third of them are improperly carried out. Poor maintenance strategies are known to cost organizations as much as 20% of their production capacity – shaving off the benefits that a move from reactive to preventive maintenance approach would provide. Despite years of expertise available in managing maintenance activities, unplanned downtime impacts almost 82% of businesses at least once every three years. Given the significant impact on production capacity, aggregated annual downtime costs for the manufacturing sector are upwards of $50 billion (WSJ) with average hourly costs of unplanned maintenance in the range of $250K.

It is against this backdrop that data-driven solutions need to be developed and deployed. Can Data Science solutions bring about significant improvement of the maintenance domain and prevent any or all of the above costs? Are the solutions scalable? Do they provide an understanding of what went wrong? Can they provide insights into alternative and improved ways to manage planned maintenance activities? Does Data Science help reduce all types of unplanned events or just a select few? These are questions that manufacturers need to be answered and it is for the experts from both maintenance and data science domains to address.

Industry understanding of managing planned maintenance is fairly mature. The highlight of this article is therefore focused on unplanned maintenance, which demands a differentiated approach to build insight and understanding around the process and subsystems.

Data Science solutions are accelerating the industry’s move towards ‘on-demand’ maintenance wherein interventions are made only if and when required. Rather than follow a fixed maintenance schedule, data science tools can now aid plants to increase run lengths between maintenance cycles in addition to improving plant safety and reliability. Besides the direct benefits that result in reduced unplanned downtime and cost of maintenance, operating equipment at higher levels of efficiency improves the overall economics of operation.

The success of this approach was demonstrated in refinery CDU preheat trains that use soft sensing triggers to decide when to process ‘clean crude’ (to mitigate the fouling impact) or schedule maintenance of fouled exchangers. Other successes were in the deployment of plant-wide maintenance of control valves, multiple-effect evaporators in plugging service, compressors in petrochemical service, and a geo-wide network of HVAC systems.

Instead of using a fixed roster for maintenance of PID control valves, plants can now detect and diagnose control valves that are malfunctioning. Additionally, in combination with domain and operations information, it can be used to suggest prescriptive actions such as auto-tuning of the valves, which improve maintenance and operations metrics.

Reducing unplanned, unavoidable events

It is important to bear in mind that not all unplanned events are avoidable. The inability to avoid events could be either because they are not detectable enough or because they are not actionable. The latter could occur either because the response time available is too low or because the knowledge to revert a system to its normal state does not exist. A large number of unplanned events however are avoidable, and the use of data science tools improves their detection and prevention with greater accuracy.

The focus of the experts working in this domain is to reduce unplanned events and transition events from unavoidable to avoidable. Using advanced tools for detection, diagnosis, and enabling timely actions to be taken, companies have managed to reduce their downtime costs significantly. The diversity of solutions that are available in the maintenance area covers both plant and process subsystems.

Some of the data science techniques deployed in the maintenance domain are briefly described below:

Condition Monitoring
This has been used to monitor and analyze process systems over time, and predict the occurrence of an anomaly. These events or anomalies could have short or long propagation times such as the ones seen in the fouling in exchangers or in the cavitation in pumps. The spectrum of solutions in this area includes real-time/offline modes of analysis, edge/IoT devices and open/closed loop prescriptions, and more. In some cases, monitoring also involves the use of soft sensors to detect fouling, surface roughness, or hardness – these parameters cannot be measured directly using a sensor and therefore, need surrogate measuring techniques.

Perhaps one of the most unique challenges working in the manufacturing domain is in the use of data reconciliation. Sensor data tend to be spurious and prone to operational fluctuations, drift, biases, and other errors. Using raw sensor information is unlikely to satisfy the material and energy balance for process units. Data reconciliation uses a first-principles understanding of the process systems and assigns a ‘true value’ to each sensor. These revised sensor values allow a more rigorous approach to condition monitoring, which would otherwise expose process systems to greater risk when using raw sensor information. Sensor validation, a technique to analyze individual sensors in tandem with data reconciliation, is critical to setting a strong foundation for any analytics models to be deployed. These elaborate areas of work ensure a greater degree of success when deploying any solution that involves the use of sensor data.

Fault Detection
This is a mature area of work and uses solutions ranging from those that are driven entirely by domain knowledge, such as pump curves and detection of anomalies thereof, to those that rely only on historical sensor/maintenance/operations data for analysis. An anomaly or fault is defined as a deviation from ‘acceptable’ operation but the context and definitions need to be clearly understood when working with different clients. Faults may be related to equipment, quality, plant systems, or operability. A good business context and understanding of client requirements are necessary for the design and deployment of the right techniques. From basic tools that use sensor thresholds, run charts, and more advanced techniques such as classification, pattern analysis, regression, a wide range of solutions can be successfully deployed.

Early Warning Systems
The detection of process anomalies in advance helps in the proactive management of abnormal events. Improving actionability or response time allows faults to be addressed before setpoints/interlocks are triggered. The methodology varies across projects and there is no ‘one-size-fits-all’ approach. Problem complexity could range from using single sensor information as lead indicators (such as using sustained pressure loss in a vessel to identify a faulty gasket that might rupture) to far more complex methods of analysis.

Typical challenges faced in developing early warning systems are in the 100% detectability of anomalies but an even larger challenge is in filtering out false indications of anomalies. The detection of 100% of the anomalies and the robust filtering techniques are critical factors to consider for successful deployment.

Enhanced Insights for Fault Identification
The importance of detection and response time in the prevention of an event cannot be overstated. But what if an incident is not easy to detect or the propagation of the fault is too rapid to allow us any time for action? The first level involves using machine-driven solutions for detection such as computer vision models, which are rapidly changing the landscape. Using these models, it is now possible to improve prediction accuracies of processes that were either not monitored or used manual monitoring. The second is to integrate the combined expertise of personnel from various job functions such as technologists, operators, maintenance engineers, and supervisors. At this level of maturity, the solution is able to baseline with the best that current operations aim to achieve. The third and by far the most complex is to move more faults in the ‘detectable’ and actionable realm. One such case was witnessed in a complex process from the metal smelting industry. Advanced-Data Science techniques using a digital twin amplified signal responses and analyzed multiple process parameters to predict the occurrence of an incident ahead of time. By gaining order of magnitude improvement in response time, it was possible to move the process fault from an unavoidable to an avoidable and actionable category.

With the context provided above, it is possible to choose a modeling approach and customize the solutions to suit the problem landscape:

Different approaches to Data Analytics

Domain-driven solution
First-principles and the rule-based approach is an example of a domain-driven solution. Traditional ways of delivering solutions for manufacturing often involve computationally intensive solutions (such as process simulation, modeling, and optimization). In one of the difficult-to-model plants, deployment was done using rule engines that allow domain knowledge and experience to determine patterns and cause-effect relationships. Alarms were triggered and advisories/recommendations were sent to the concerned stakeholders regarding what specific actions to undertake each time the model identified an impending event.

Domain-driven approaches also come in handy in the case of ‘cold start’ where solutions need to be deployed with little or no data availability. In some deployments in the mechanical domain, the first-principles approach helped identify >85% of the process faults even at the start of operations.

Pure data-driven solutions
A recent trend seen in the process industry is the move away from domain-driven solutions due to challenges in getting the right skills to deploy solutions, computation infrastructure requirements, customized maintenance solutions, and the requirement to provide real-time recommendations. Complex systems such as naphtha cracking, alumina smelting which are hard to model have harnessed the power of data science to not just diagnose process faults but also enhance response time and bring more finesse to the solutions.

In some cases, domain-driven tools have provided high levels of accuracy in analyzing faults. One such case was related to compressor faults where domain data was used to classify them based on a loose bearing, defective blade, or polymer deposit in the turbine subsystems. Each of these faults was identified using sensor signatures and patterns associated with it. Besides getting to the root cause, this also helped prescribe action to move the compressor system away from anomalous operation.

These solutions need to bear in mind that the operating envelope and data availability covers all possible scenarios. The poor success of deployments using this approach is largely due to insufficient data that covers plant operations and maintenance. However, the number of players offering a purely data-driven solution is large and soon replacing what was traditionally part of a domain engineer’s playbook.

Blended solutions
Blended solutions for the maintenance of process systems combine the understanding of both data science and domain. One such project was in the real-time monitoring and preventive maintenance of >1200 HVAC units across a large geographic area. The domain rules were used to detect and diagnose faults and also identify operating scenarios to improve the reliability of the solutions. A good understanding of the domain helps in isolating multiple anomalies, reducing false positives, suggesting the right prescriptions, and more importantly, in the interpretability of the data-driven solutions.

The differentiation comes from using the combined intelligence from AI / ML models, domain knowledge and knowledge of deployment success are integrated into the model framework.

Customizing the toolkit and determining the appropriate modeling approach are critical to delivery. Given the uniqueness of each plant and problem and the requirement for a high degree of customization, makes the deployment of solutions in a manufacturing environment is fairly challenging. This fact is validated by the limited number of solution providers serving this space. However, the complexity and nature of the landscape need to be well understood by both the client and the service provider. It is important to note that not all problems in the maintenance space are ‘big data’ problems requiring analysis in real-time, using high-frequency data. Some faults with long propagation times can use values averaged over a period of time while other systems with short response time requirements may require real-time data. Where maintenance logs and annotations related to each event (and corrective action) are recorded, one could go with a supervised learning approach, but this is not always possible. In cases where data on faults and anomalies are not available, a one-class approach to classify the operation into normal/abnormal modes has also been used. Solution maturity improves with more data and failure modes identified over time.

A staged solution approach helps in bringing in the right level of complexity to deliver solutions that evolve over time. Needless to say, it takes a lot of experience and prowess to marry the generalized understanding with the customization that each solution demands.

Edge/IoT

A fair amount of investment needs to be made at the beginning of the project to understand the hardware and solution architecture required for successful deployment. While the security of data is a primary consideration, other factors such as computational power, cost, time, response time, open/closed-loop architecture are added considerations in determining the solution framework. Experience and knowledge help understand additional sensing requirements and sensor placement, performance enhancement through edge/cloud-based solutions, data privacy, synchronicity with other process systems, and much more.

By far, the largest challenge is witnessed on the data front (sparse, scattered, unclean, disorganized, unstructured, not digitized, and so on) that prevent businesses from seeing quick success. Digitization and creating data repositories, which set the foundation for model development, take a lot of time.

There is also a multitude of control systems, specialized infrastructure, legacy systems within the same manufacturing complex that one may need to work through. End-to-end delivery with the front-end complexity in data management creates a significant entry barrier for service providers in the maintenance space.

Maintenance cuts across multiple layers of a process system. The maintenance solutions vary as one moves from a sensor to a control loop, equipment with multiple control valves all the way to a flowsheet/enterprise layer. Maintenance across these layers requires a deep understanding of both the hardware as well as process aspects, a combination that is often hard to put together. Sensors and control valves are typically maintained by those with an Instrumentation background, while equipment maintenance could fall in a mechanical or chemical engineer’s domain. On the other hand, process anomalies that could have a plant-level impact are often in the domain of operations/technology experts or process engineers.

Data Science facilitates the development of insights and generalizations required to build understanding around a complex topic like maintenance. It helps in the generalization and translation of learnings across layers within the process systems from sensors all the way to enterprise and other industry domains as well. It is a matter of time before analytics-driven solutions that help maintain safe and reliable operations become an integral part of plant operations and maintenance systems. We need to aim towards the successes that we witness in the medical diagnostics domain where intelligent machines are capable of detecting and diagnosing anomalies. We hope that similar analytics solutions will go a long way to keep plants safe, reduce downtime and provide the best of operations efficiencies that a sustainable world demands.

Today, the barriers to success are in the ability to develop, a clear understanding of the problem landscape, plan end-to-end and deliver customized solutions that take into account business priorities and ROI. Achieving success at a large scale will demand reducing the level of customization required in each deployment – a constraint that is overcome by few subject matter experts in the area today.

The post Data Science Strategies for Effective Process System Maintenance appeared first on Tiger Analytics.

Defining Financial Ethics: Transparency and Fairness in Financial Institutions’ use of AI and ML

onemg — Fri, 10 Dec 2021 19:35:26 +0000

The last few years have seen a rapid acceleration in the use of disruptive technologies such as Machine Learning and Artificial Intelligence in financial institutions (FI). Improved software and hardware, coupled with a digital-first outlook, has led to a steep rise in the use of such applications to advance outcomes for consumers and businesses alike.

By embracing AI/ML, the early adopters in the industry have been able to streamline decision processes involving large amounts of data, avoid bias, and reduce chances of error and fraud. Even the more traditional banks are investing in AI systems are using state-of-the-art ML and deep learning algorithms that have paved the way for quicker and better reactions to the changing consumer needs and market dynamics.

The Covid-19 pandemic has only aided in making the use of AI/ML-based tools more widespread and easily scalable across sectors. At Tiger Analytics, we have been at the heart of the action and have assisted several clients to reap the benefits of AI/ML across the value chain.
Pilot-use cases where FIs have seen success by using AI/ML-based solutions:

Smarter risk management
Real-time investment advice
Enhanced access to credit
Automated underwriting
Intelligent customer service and chatbots

The challenges

While time, cost, and efficiency have seen drastic improvement thanks to AI/ML, concerns over transparency, accountability, and inclusivity prevail. Given how highly regulated and impactful the industry is, it becomes pertinent to maintain a sense of clarity and inclusiveness.
Problems in governance of AI/ML:

Transparency
Fairness
Bias
Reliability/soundness
Accountability

How can we achieve this? By, first and foremost, finding and evaluating safe and responsible ways to integrate AI/ML into everyday processes to better suit the needs of clients and customers.

By making certain guidelines uniform and standardized, we can set the tone for successful AI/ML implementation. This involves robust internal governance processes and frameworks, as well as timely interventions and checks, as outlined in Tiger’s response document and comments to the regulatory agencies in the US.

These checks become even more relevant where regulatory standards or guidance are inadequate specifically on the use of AI in the FI. However, efforts are being made to hold FIs against some kind of standard.

The table below illustrates the issuance of AI guidelines across different countries:

Source: FSI Insights on Policy Implementation No. 35, By Jeremy Prenio & Jeffrey Yong, August 2021

Supervisory guidelines and regulations must be understood and customized to suit the needs of the various sectors.

To overcome these challenges, this step of creating uniform guidance by the regulatory agencies is essential — it opens up a dialogue on the usage of AI/ML-based solutions, and also brings in different and diverse voices from the industry to share their triumphs and concerns.

Putting it out there

As a global analytics firm that specializes in creating bespoke AI and ML-based solutions for a host of clients, at Tiger, we recognize the relevance of a framework of guidelines that enable feelings of trust and responsibility.

It was this intention of bringing in more transparency that led us to put forward our response to the Request for Information and Comment on Financial Institutions’ Use of Artificial Intelligence, including Machine Learning (RFI) by the following agencies:

Board of Governors of the Federal Reserve System (FRB)
Bureau of Consumer Financial Protection (CFPB)
Federal Deposit Insurance Corporation (FDIC)
National Credit Union Administration (NCUA) and,
Office of the Comptroller of the Currency (OCC)

Our response to the RFI is structured in such a way that it is easily accessible to even those without the academic and technical knowledge of AI/ML. We have kept the conversation generic, steering away from deep technical jargon in our views.

Ultimately, we recognize that the role of regulations around models involving AI and ML is to create fairness and transparency for everyone involved.

Transparency and accountability are foundation stones at Tiger too, which we apply and deploy while developing powerful AI and ML-based solutions to our clients — be it large or community banks, credit unions, fintech, and other financial services.

We are eager to see the outcome of this exercise and hope that it will result in consensus and uniformity of definitions, help in distinguishing facts from myth, and allow for a gradation of actual and perceived risks arising from the use of AI and ML models.

We hope that our response not only highlights our commitment to creating global standards in AI/ML regulation, but also echoes Tiger’s own work culture and belief system of fairness, inclusivity, and equality.

Want to learn more about our response? Refer to our recent interagency submission.

The post Defining Financial Ethics: Transparency and Fairness in Financial Institutions’ use of AI and ML appeared first on Tiger Analytics.

Suez Canal Crisis & Building Resilient Supply Chains

TA@2023 — Thu, 01 Apr 2021 17:51:24 +0000

The Suez Canal crisis has brought the discourse on supply chain resilience back into focus. The incident comes at a time when global supply chains are inching back to normalcy in the hope that Covid-19 vaccinations will help the economy bounce back. Considering that the canal carries about 10% to 12% of global trade, logistics will take time to recover even though the crisis is now resolved.

The Cascading Impact

Despite the fact that the Suez Canal blockage may not be as significant as Covid-19 disruptions, it will take months to remove the pressure points in the global supply chain. In the world of the interconnected global supply chains, the choking of a significant artery such as the Suez Canal will have a cascading effect with delayed deliveries to consumers, rising prices due to shortage, loss of efficiency at factories due to short supply, and increased pressure on intermodal/road transportation when the traffic ramps up.

In the US market, the east coast ports will bear the brunt of the fallout. Data shows that nearly a third of imports into the east coast are routed via the Suez Canal. In the near term, there will be a lull period followed by an inbound rush when the backlog of delayed shipments arrives, stressing the logistics network.

This is not the first accident of its kind; it’s likely not the last either. Given this reality, companies would do well to build resilience in the supply chains proactively.

Strategies for Supply Chain Resilience

Companies have used several different strategies to mitigate the risk to supply chains. Multi-Geo Manufacturing – Developments such as the straining of the US/China relationship and the disruption caused by Covid-19 have led to many firms looking at alternate manufacturing locations outside China, such as India.

– Multi-Sourcing – Dual or more diversified supplier bases for critical raw materials or components.

– In-Sourcing / Near Shoring – Companies have started to build regional sourcing within the Americas or even in-house to mitigate the risk. One of our clients is exploring this option for critical products with much closer/tighter integration across the value chain.

– Inventory and Capacity Buffers – Moving away from the lean supply chain’s traditional mindset, customers are increasing the inventory and capacity buffers. One of our manufacturing clients had doubled down on stocks early last year to mitigate any supply risk due to Covid-19.

– Flexible Distribution – Companies are adopting multiple modes of transportation such as air and rail so that they have a backup in case of disruption of one of the modes of transportation. They are also moving warehouses closer to the customer.

How Analytics can enable the resilience journey

The strategies elaborated in the previous section imply that there will be an additional cost of building the necessary redundancy rather than going with a lean principles approach. Most companies have accepted this additional expenditure since the risk of not doing it far outweighs the cost of redundancy. When supply chains become complex with multiple paths for product flow, analytics can help keep the operations nimble and make the right decision to balance cost and service levels. Analytics can enhance two types of capabilities:

– Operational Capabilities are primarily focused on risk containment. When the risk event is expected to occur or has occurred, machine learning models can generate real-time alerts and insights for the supply chain operations teams to take the next best actions. For example:

– Freight Pricing Impact: One of our logistics clients uses a pricing model to use truckload equipment. We designed this pricing model to look at the demand/supply imbalance at the origin/destination and predict the prices accordingly. It is expected that US East Coast ports will see a surge in Inbound containers once the Suez Canal blockage eases, and transportation prices will increase when the demand is higher than supply. Visibility into pricing helps our client secure the capacity upfront at non-peak pricing and ensure timely delivery to its customers.

– On-Time in Full (OTIF) Risk Prediction: One of our manufacturing clients uses an ML tool that predicts ‘OTIF miss risk’ at each order level. We have built-in recommendations on what levers can be used to meet the SLA or reduce the penalty, e.g., Pick/Pack/Load priority in warehouse OR air freight.

– Risk Event Prediction: Risk data related to natural disasters, political strikes, labor disputes, financial events, environmental events, etc., can be tied to the enterprise supply chain. One of our clients uses risk models to simulate the impact of various risks on their supply chains’ better plan responses.

– Strategic Capabilities are focused on avoiding risk impact and enabling faster recovery. A component of this capability is a Digital Twin Supply Chain, which mirrors the physical network. Some of our clients use digital twins to do both mid-term and long-term risk planning involving some of the below activities:

– Assessing current network and identifying potential risk areas.

– Scenario planning and risk & cost analysis to provide inputs into Sales & Operation Planning.

– Planning and building long-term approaches such as multiple sourcing or multi manufacturing.

– Revamping the supplier and distribution networks – Integrating supplier/carrier scorecard, cost, etc., into the network data to visualize multiple options’ tradeoffs.

– Pressure testing design choices at various levels. E.g., impact missed orders, delays, and inventory levels if a particular site went down, or how much it will take to initiate the contingency plan and interim impact.

Conclusion

Recent developments have just acted as catalysts for an already growing affinity for AI and analytics. Gartner states that by 2023 at least 50% of large global companies will be using AI, advanced analytics, and IoT in supply chain operations gearing towards a digital twin model.

The companies which are agile and can respond to rapidly changing conditions are the ones that will survive increasingly frequent disruptions and add real value to customers and communities. AI & Analytics will be key enablers in building resilient supply chains that are proactive, agile, and maintain a balance between various tradeoffs.

The post Suez Canal Crisis & Building Resilient Supply Chains appeared first on Tiger Analytics.

Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins

onemg — Thu, 24 Dec 2020 18:19:09 +0000

Historically, manufacturing equipment maintenance has been done during scheduled service downtime. This involves periodically stopping production for carrying out routine inspections, maintenance, and repairs. Unexpected equipment breakdowns disrupt the production schedule; require expensive part replacements, and delay the resumption of operations due to long procurement lead times.

Sensors that measure and record operational parameters (temperature, pressure, vibration, RPM, etc.) have been affixed on machinery at manufacturing plants for several years. Traditionally, the data generated by these sensors was compiled, cleaned, and analyzed manually to determine failure rates and create maintenance schedules. But every equipment downtime for maintenance, whether planned or unplanned, is a source of lost revenue and increased cost. The manual process was time-consuming, tedious, and hard to handle as the volume of data rose.

The ability to predict the likelihood of a breakdown can help manufacturers take pre-emptive action to minimize downtime, keep production on track, and control maintenance spending. Recognizing this, companies are increasingly building both reactive and predicted computer-based models based on sensor data. The challenge these models face is the lack of a standard framework for creating and selecting the right one. Model effectiveness largely depends on the skill of the data scientist. Each model must be built separately; model selection is constrained by time and resources, and models must be updated regularly with fresh data to sustain their predictive value.

As more equipment types come under the analytical ambit, this approach becomes prohibitively expensive. Further, the sensor data is not always leveraged to its full potential to detect anomalies or provide early warnings about impending breakdowns.

In the last decade, the Industrial Internet of Things (IIoT) has revolutionized predictive maintenance. Sensors record operational data in real-time and transmit it to a cloud database. This dataset feeds a digital twin, a computer-generated model that mirrors the physical operation of each machine. The concept of the digital twin has enabled manufacturing companies not only to plan maintenance but to get early warnings of the likelihood of a breakdown, pinpoint the cause, and run scenario analyses in which operational parameters can be varied at will to understand their impact on equipment performance.

Several eminent ‘brand’ products exist to create these digital twins, but the software is often challenging to customize, cannot always accommodate the specific needs of each and every manufacturing environment, and significantly increases the total cost of ownership.

ML-powered digital twins can address these issues when they are purpose-built to suit each company’s specific situation. They are affordable, scalable, self-sustaining, and, with the right user interface, are extremely useful in telling machine operators the exact condition of the equipment under their care. Before embarking on the journey of leveraging ML-powered digital twins, certain critical steps must be taken:

1. Creation of an inventory of the available equipment, associated sensors and data.

2. Analysis of the inventory in consultation with plant operations teams to identify the gaps. Typical issues may include missing or insufficient data from the sensors; machinery that lacks sensors; and sensors that do not correctly or regularly send data to the database.

3. Coordination between the manufacturing operations and analytics/technology teams to address some gaps: installing sensors if lacking (‘sensorization’); ensuring that sensor readings can be and are being sent to the cloud database; and developing contingency approaches for situations in which no data is generated (e.g., equipment idle time).

4. A second readiness assessment, followed by a data quality assessment, must be performed to ensure that a strong foundation of data exists for solution development.

This creates the basis for a cloud-based, ML-powered digital twin solution for predictive maintenance. To deliver the most value, such a solution should:

Use sensor data in combination with other data as necessary
Perform root cause analyses of past breakdowns to inform predictions and risk assessments
Alert operators of operational anomalies
Provide early warnings of impending failures
Generate forecasts of the likely operational situation
Be demonstrably effective to encourage its adoption and extensive utilization
Be simple for operators to use, navigate and understand
Be flexible to fit the specific needs of the machines being managed

When model-building begins, the first step is to account for the input data frequency. As sensors take readings at short intervals, timestamps must be regularized and resamples taken for all connected parameters where required. At this time, data with very low variance or too few observations may be excised. Model data sets containing sensor readings (the predictors) and event data such as failures and stoppages (the outcomes) are then created for each machine using both dependent and independent variable formats.

To select the right model for anomaly detection, multiple models are tested and scored on the full data set and validated against history. To generate a short-term forecast, gaps related to machine testing or idle time must be accounted for, and a range of models evaluated to determine which one performs best.

Tiger Analytics used a similar approach when building these predictive maintenance systems for an Indian multinational steel manufacturer. Here, we found that regression was the best approach to flag anomalies. For forecasting, the accuracy of Random Forest models was higher compared to ARIMA, ARIMAX, and exponential smoothing.

Using a modular paradigm to build ML-powered digital twin makes it straightforward to implement and deploy. It does not require frequent manual recalibration to be self-sustaining, and it is scalable so it can be implemented across a wide range of equipment with minimal additional effort and time.

Careful execution of the preparatory actions is as important as strong model-building to the success of this approach and its long-term viability. To address the challenge of low-cost, high-efficiency predictive maintenance in the manufacturing sector, employ this sustainable solution: a combination of technology, business intelligence, data science, user-centric design, and the operational expertise of the manufacturing employees.

This article was first published in Analytics India Magazine.

The post Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins appeared first on Tiger Analytics.

Enhancing Mental Healthcare: Machine Learning’s Role in Clinical Trials

onemg — Thu, 29 Oct 2020 23:59:56 +0000

World Mental Health Day on 10th October casts a long-overdue spotlight on one of the most neglected areas of public health. Nearly a billion people have a mental disorder, and a suicide occurs every 40 seconds. In developing countries, under 25% of people with mental, substance use, or neurological disorders receive treatment¹. COVID-19 has worsened the crisis; with healthcare services disrupted, the hidden pandemic of mental ill-health remains largely unaddressed.

In this article, we share some perspectives on the role ML can play and an example of a real-life AI solution we built at Tiger Analytics to address a specific mental-health-related problem.

ML is already a Part of Physical Healthcare

Algorithms process Magnetic Resonance Imaging (MRI) scans. Clinical notes are parsed to pinpoint the onset of illnesses earlier than physicians can discern them. Cardiovascular disease and diabetes —two of the leading causes of death worldwide— are diagnosed using neural networks, decision trees, and support vector machines. Clinical trials are monitored and assessed remotely to maintain physical distancing protocols.

These are ‘invasive’ approaches with the objective of automating what can —and usually is— be done by humans, but at speed and scale. In the field of mental health, ML can be applied in non-invasive, more humanistic ways that nudge physicians towards better treatment strategies.

Clinical Trials of Mental Health Drugs

In clinical trials of mental health drugs, physicians and patients engage in detailed discussions of the patients’ mental state at each treatment stage. The efficacy of these drugs is determined using a combination of certain biomarkers, body vitals, and mental state as determined by the patient’s interaction with the physician.

The problem with the above approach is that an important input in determining drug efficacy is the responses of a person who has been going through mental health issues. To avoid errors, these interviews/interactions are recorded, and multiple experts listen to the long recordings to evaluate the quality of the interview and the conclusions made.

Two concerns arise: first, time and budget allow only a sample of interviews to be evaluated, which means there is an increased risk of fallacious conclusions regarding drug efficacy; and second, patients may not express all they are feeling in words. A multitude of emotions may be missed or misinterpreted, generating incorrect evaluation scores.

The Problem that Tiger Team Tackled

Working with a pharmaceutical company, Tiger Analytics used speech analytics to identify ‘good’ interviews, i.e., ones that meet quality standards for inclusion in clinical trials, minimizing the number of interviews that were excluded after evaluation, and saving time and expense.

As a data scientist, the typical challenges you face when working on a problem such as this are – What types of signal processing you can use to extract audio features? What non-audio features would be useful? How do you remove background noise in the interviews? How do you look for patterns in language? How do you solve for reviewers’ biases, inevitable in subjective events like interviews?

Below we walk you through the process the Tiger Analytics team used to develop the solution.

Step 1: Pre-processing

We removed background noise from the digital audio files and split them into alternating sections of speech and silence. We grouped the speech sections into clusters, each cluster representing one speaker. We created a full transcript of the interview to enable language processing.

Step 2: Feature extraction

We extracted several hundred features of the audio, from direct aspects like interview duration and voice amplitude to the more abstract speech rates, frequency-wise energy content, and Mel-frequency cepstral coefficients (MFCCs). We used NLP to extract several features from the interview transcript. These captured the unique personal characteristics of individual speakers.

Beyond this, we captured features such as interview length, tone of the interviewer, any gender-related patterns, interview load on the physician, time of the day, and many more features.

Step 3: Prediction

We constructed an Interview Quality Score (IQS) representing the combination of several qualitative and quantitative aspects of each interview. We ensembled boosted trees, support vector machines, and random forests to segregate high-quality interviews from those with issues.

This model was able to effectively pre-screen about 75% of the interviews as good or bad and was unsure about the remainder. Reviewers could now work faster and more productively, focusing only on the interviews where the model was not too confident. Overall prediction accuracy improved 2.5x, with some segments returning over 90% accuracy.

ML Models ‘Hear’ What’s Left Unsaid

The analyses provided independent insights regarding pauses, paralinguistics (tone of voice, loudness, inflection, pitch), speech disfluency (fillers like ‘er’, ‘um’), and physician performance during such interviews.

These models have wider applicability beyond clinical trials. Physicians can use model insights to guide treatment and therapy, leading to better mental health outcomes for their patients, whether in clinical trials or practice, addressing one of the critical public health challenges of our time.

References

World Health Organization, United for Global Mental Health and the World Federation for Mental Health joint news release, 27 August 2020
This article was first published in Analytics India Magazine – https://analyticsindiamag.com/machine-learning-mental-health-notes-from-tiger-analytics/

The post Enhancing Mental Healthcare: Machine Learning’s Role in Clinical Trials appeared first on Tiger Analytics.

Credit Monitoring for SMEs: ML-Driven Early Warning Solutions

onemg — Thu, 15 Oct 2020 22:38:56 +0000

Companies, from small enterprises to giant corporations, are a great opportunity for banks and financial service providers to expand their credit-lending business. A robust and dynamic risk management strategy empowers banks in credit monitoring to take advantage of this opportunity regardless of whether economies are thriving or in turmoil. Banks must continually revise their prediction of whether their corporate customers are likely to face financial distress, and if so, when. Being warned in advance, banks can take mitigative action to minimize or possibly avoid loss in the event of customer default.

Time to try something new

The established risk rating models employ company data such as financial ratios, industry classification, workforce, etc. alongside conventional credit payment behavioral variables. In our work with a major European bank and in reviewing existing research, we found that traditional statistical models were less efficient in providing early warnings for SMEs and start-ups where data from credit bureaus and public tracking agencies were unavailable.

There is an urgent need for more agile, more sensitive credit risk models that can leverage the wealth of internal transactional and behavioral data and depend less on the external sources that traditional models require. ML-based models efficiently capture complex non-linear relationships among a diverse set of variables.

As the start-up culture grows, financial institutions, wishing to make the most of credit lending opportunities in this uncharted market, are willing to experiment with new approaches that go beyond the legacy frameworks mandating ‘white box’ standard statistical approaches. ML-driven models are the right choice.

Developing ML models for early warning

On the face of it, predicting whether a company is likely to default on credit seems to be a standard classification problem, with a set of factors pointing towards the occurrence of a default. In practice, the primary challenge is to train models to recognize the risk as significant early enough for mitigative action. Working with what is usually considered ‘weak signals,’ the models are trained on behavioral data from at least three months prior to the actual default event.

When fixing the model design parameters (target event definition, gap between prediction and event period, etc.), feature engineering comes into play. This involves defining both simple and complex variables reflecting the potential signals preceding a default event. Typical transformations include velocity variables to capture trends; standard deviations, and z-scores to normalize client behavior within micro-clusters of similar clients by industry; size; credit exposure; and other meaningful ratios.

The underlying data consist of transactions from current accounts and cards across instruments and channels; credit utilization and payment patterns within the bank; credit utilization data from the central bank; and ownership and features of other products and services availed within the bank or from other banks. Credit monitoring and quality analysts, with their expertise in customer behavior, provide many of the inputs used to identify these features.

Next, we must segment customers whose operations are alike and may have similar predictors of default. This step is important: a one-size-fits-all model may not call attention to specific clusters of customers who are underrepresented in the overall population of corporate borrowers.

One such segmentation criteria is data availability. For example, central banks typically provide credit utilization data only for companies above a certain exposure threshold. Segment-level models ensure that we do not have to deal with low fill-rates for variables that largely do not apply to a given segment. Other segmentation considerations are current credit exposure and the type of credit line.

With robust feature selection techniques, over 2,000 features can be reduced to under a hundred key variables that contribute significantly to default prediction.

ML models perform better versus traditional techniques

In our work with a leading European bank, we evaluated several classification models, starting with the basic approach of Logistic Regression to the more complex Random Forest (RF) and highly advanced techniques such as XG Boost (XG).

While Logistic Regression delivered results comparable to RF and XG in accuracy, XG had a larger AUC (that is, better power to distinguish) and a consistently good K-S score of over 65% across segments. Further, the ML-based models performed about twice as well as the bank’s internal rule-based early warning system, both by defaulter count and overall exposure value. XG was also superior in handling the scarce observational data from some variables.

ML models are often criticized as being a black box, obscuring the role of predictors in determining the outcome. Packages like SHAP in Python enable non-practitioners to see the exact order of the top predictors at a customer level, giving them more confidence in the underlying signals and analysis that drive model results.

The top predictors vary by customer segments

We observed that for SMEs, corporate clients with lower exposure and overdraft account-based lines of credit, the top predictors of default risk are the current account balance and transaction-related variables. Next come credit utilization and overdraft trends over the preceding six months.

For the low exposure segment with term loans, top predictors include fund transfer behavior and delays in the six most recent payments, followed by trends in overdraft accounts.

In high exposure segments, credit utilized-to-granted ratios from central banks and other agencies are more influential together with the bank’s internal ratings, which reflect company-related information.

For businesses with a factoring line of credit with the bank, the typical expiry dates and credit utilized/days to expiry ratios were useful early predictors of the risk of default.

Changing business thinking for changing times

Developing and deploying models is only the first step in credit monitoring: the key challenge is getting financial institutions to adopt them with confidence.

It is important to ensure that the model is easy to interpret, especially in the context of early warnings, when even the top predictors look like ‘weak signals’ three to six months out from the actual event. Another challenge is having models that are dynamic and can adjust themselves to the new normal of accelerated change. Financial institutions should evaluate early warning models that can learn to normalize customer behavioral variables by changing macroeconomic and industry-specific indicators.

Coming soon: We share our perspectives from working with audit and compliance teams to unlock the potential of AI and ML to fight money laundering, cybersecurity attacks, and investment advisory fraud.

The post Credit Monitoring for SMEs: ML-Driven Early Warning Solutions appeared first on Tiger Analytics.

REST API with AWS SageMaker: Deploying Custom Machine Learning Models

onemg — Thu, 17 Sep 2020 11:22:56 +0000

Introduction

AWS SageMaker is a fully managed machine learning service. It provides you support to build models using built-in algorithms, with native support for bring-your-own algorithms and ML frameworks such as Apache MXNet, PyTorch, SparkML, TensorFlow, and Scikit-Learn.

Why AWS SageMaker?

Developers or Data Scientists need not worry about infrastructure management and cluster utilization and can experiment with different things.
Supports end-to-end machine workflow with integrated Jupyter notebooks, data labeling, hyperparameter optimization, hosting scalable inference endpoints with autoscaling to handle millions of requests.
Provides standard machine learning models, which are optimized to run against extremely large data in a distributed environment.
Multi-model training across multiple GPUs and leverages spot instances to lower the training cost.

Note: You cannot use SageMaker’s built-in algorithms for all the cases, especially when you have custom algorithms that require building custom containers.

This post will walk you through the process of deploying a custom machine learning model (bring-your-own-algorithms), which is trained locally, as a REST API using SageMaker, Lambda, and Docker. The steps involved in the process are shown in the image below-

The process consists of five steps-

Step 1: Building the model and saving the artifacts.
Step 2: Defining the server and inference code.
Step 3: Building a SageMaker Container.
Step 4: Creating Model, Endpoint Configuration, and Endpoint.
Step 5: Invoking the model using Lambda with API Gateway trigger.

Step 1: Building the Model and Saving the Artifacts

First, we have to build the model and serialize the object, which is then used for prediction. In this post, we are using simple Linear Regression (i.e., one independent variable). Once you serialize the Python object to Pickle file, you have to save that artifact (pickle file) in tar.gz format and upload it to the S3 bucket.

Step 2: Defining the Server and Inference Code

When an endpoint is invoked, SageMaker interacts with the Docker container, which runs the inference code for hosting services, processes the request, and returns the response. Containers have to implement a web server that responds to /invocations and /ping on port 8080.

Inference code in the container will receive GET requests from the infrastructure, and it should respond to SageMaker with an HTTP 200 status code and an empty body, which indicates that the container is ready to accept inference requests at invocations endpoints.

Code: https://gist.github.com/NareshReddyy/9f1f9ab7f6031c103a0392d52b5531ad

To make the model REST API enabled, you need Flask, which is WSGI(Web Server Gateway Interface) application framework, Gunicorn the WSGI server, and nginx the reverse-proxy and load balancer.

Code: https://github.com/NareshReddyy/Sagemaker_deploy_own_model.git

Step 3: Building a SageMaker Container

SageMaker uses Docker containers extensively. You can put your scripts, algorithms, and inference codes for your models in these containers, which includes the runtime, system tools, libraries, and other code to deploy your models, which provides flexibility to run your model. The Docker images are built from scripted instructions provided in a Dockerfile.

The Dockerfile describes the image that you want to build with a complete installation of the system that you want to run. You can use standard Ubuntu installation as a base image and run the normal tools to install the things needed by your inference code. You will have to copy the folder(Linear_regx) where you have the nginx.conf, predictor.py, serve and wsgi.py to /opt/code and make it your working directory.

The Amazon SageMaker Containers library places the scripts that the container will run in the /opt/ml/code/directory

Code: https://gist.github.com/NareshReddyy/2aec71abf8aca6bcdfb82052f62fbc23

To build a local image, use the following command-

docker build

Create a repository in AWS ECR and tag the local image to that repository.

The repository has the following structure:

.dkr.ecr..amazonaws.com/:

docker tag :

Before pushing the repository, you have to configure your AWS CLI and login

Once you execute the above command you will see something like docker login -u AWS -p xxxxx. Use this command to log in to ECR.

docker push :

Step 4: Creating Model, Endpoint Configuration, and Endpoint

Creating models can be done using API or AWS management console. Provide Model name and IAM role.

Under the Container definition, choose Provide model artifacts and inference image location and provide the S3 location of the artifacts and Image URI.

After creating the model, create Endpoint Configuration and add the created model.

When you have multiple models to host, instead of creating numerous endpoints, you can choose Use multiple models to host them under a single endpoint (this is also a cost-effective method of hosting).

You can change the instance type and instance count and enable Elastic Interface(EI) based on your requirement. You can also enable data capture, which saves prediction requests and responses in the S3 bucket, thereby providing options to set alerts for when there are deviations in the model quality, such as data drift.

Create Endpoint using the existing configuration

Step 5: Invoking the Model Using Lambda with API Gateway Trigger

Create Lambda with API Gateway trigger.

In the API Gateway trigger configuration, add a REST API to your Lambda function to create an HTTP endpoint that invokes the SageMaker endpoint.

In the function code, read the request received from the API gateway and pass the input to the invoke_endpoint and capture and return the response to the API gateway.

When you open the API gateway, you can see the API created by the Lambda function. Now you can create the method required (POST) and integrate the Lambda function, and test by providing the input in the request body and check the output.

You can test your endpoint either by using SageMaker notebooks or Lambda.

Conclusion

SageMaker enables you to build complex ML models with a wide variety of options to build, train, and deploy in an easy, highly scalable, and cost-effective way. Following the above illustration, you can deploy a machine learning model as a serverless API using SageMaker.

The post REST API with AWS SageMaker: Deploying Custom Machine Learning Models appeared first on Tiger Analytics.

A Beginner’s Guide: Enter the World of Bayesian Belief Networks

onemg — Thu, 09 Jul 2020 11:13:56 +0000

Introduction

In the world of machine learning and advanced analytics, every day data scientists solve tons of problems with the help of newly developed and sophisticated AI techniques. The main focus while solving these problems is to deliver highly accurate and error-free results. However, while implementing these techniques in a business context, it is essential to provide a list of actionable levers/drivers of the model output that the end-users can use to make business decisions. This requirement applies to solutions developed across industries. One such machine learning technique that focuses on providing such actionable insights is the Bayesian Belief Network, which is the focus of this blog. The assumption here is that the reader has some understanding of machine learning and some of the associated terminologies.

Several approaches are currently being used to understand these drivers/levers. However, most of them follow a simple approach to understand the direct cause and effect relationship between the predictors and the target. The main challenges with such an approach are that:

1. The focus remains on the relationship between predictors and target, and not on the inter-relationship between the predictor attributes. A simple example is a categorical variable with various states

2. It is assumed that each of the predictors has a direct relationship with the target variables, while, in reality, the variables could be correlated and connected. Also, the influence of one predictor on the other is ignored while calculating the overall impact of the predictor on the target. For example, in almost every problem, we try handling multicollinearity and choosing a correlation cut-off to disregard the near collinear variables. Nevertheless, complete removal of multicollinearity in the model is rare

The Bayesian Network can be utilized to address this challenge. It helps in understanding the drivers without ignoring the relationship among variables. It also provides a framework for the prior assessment of the impact of any actions that have to be taken to improve the outcome. A unique feature of this approach is that it allows for the propagation of evidence through the network.

Before getting into the details of driver analysis using Bayesian Network, let us discuss the following:

1. The Bayesian Belief Network

2. Basic concepts behind the BBN

3. Belief Propagation

4. Constructing a discrete Bayesian Belief Network

1. The Bayesian Belief Network

A Bayesian Belief Network (BBN) is a computational model that is based on graph probability theory. The structure of BBN is represented by a Directed Acyclic Graph (DAG). Formally, a DAG is a pair (N, A), where N is the node-set, and A is the arc-set. If there are two nodes u and v belonging to N, and there is an arc going from u to v, then u is termed as the parent of v and v is called the child of u. In terms of the cause-effect relationship, u is the cause, and v is the effect. A node can be a parent of another node while also being the child to a different node. An example is illustrated in the image below-

a and b are parents of u, and u is the child of a and b. At the same time, u is also a single parent for v. With respect to the cause-effect relationship, a and b are direct causes of u, and u is directly causing v, implying that a and b are indirectly responsible for the occurrence of v.

2. Basic concepts behind the BBN

The Bayesian Belief Network is based on the Bayes Theorem. A brief overview of it is provided below-

Bayesian Theorem

For two random variables X and Y, the following equation holds-

If the X and Y are independent of each other then

Joint Probability

Given X1, X2,………,Xn are features (nodes) in a BBN. Then Joint probability is defined as:

Marginal Probability

Given the joint probability, marginal probability of X1 = x0 is calculated as:

where x2 ,x3,…xn are the set of values corresponding to X2, X3,…..,Xn

3. Belief Propagation

Now let us try to understand belief propagation with the help of an example. We will consider an elementary network.

Now the above network says that the Train strike influences Allen’s and Kelvin’s work timings. And the probabilities are distributed as below:

Given we know Train Strike probability P(T) and conditional probabilities P(K|T) and P(A|T). we can calculate P(A) and P(K).

P(A= Y) = |T,K P(A=Y,T,K) = P(A=Y,K=Y,T=Y) + P(A=Y,K=N,T=Y) + P(A=Y,K=Y,T=N) + P(A=Y,K=N,T=N)

= P(T=Y)*P(A=Y|T=Y)*P(K=Y|T=Y) + P(T=Y)*P(A=Y|T=Y)*P(K=N|T=Y)

+ P(T=N)*P(A=Y|T=N)*P(K=Y|T=N) + P(T=N)*P(A=Y|T=N)*P(K=N|T=N)

= P(A=Y|T=Y)*P(T=Y) + P(A=Y|T=N)*P(T=N) …………………………………..(As, P(K=Y|T=Y) + P(K=N|T=Y) =1 )

= 0.7*0.1 + 0.6*0.9 = 0.61

Similarly, making the above formulation shorter.

P(K= Y) = P(K=Y|T=Y)*P(T=Y) + P(K=Y|T=N)*P(T=N)

= 0.6*0.1 + 0.1*0.9 = 0.15

Now, let us say we came to know that Allen is late, but we do not know if there is a train strike. Can we know the probability that Kelvin will be late given we know that Allen is late? Let us see how the evidence that is Allen is late is propagating through the network.

Let us estimate the probability of the train strike given we already know Allen is late.

P(T=Y|A=Y) = P(A=Y|T=Y) * P(T=Y)/P(A=Y) = 0.7*0.1/0.61 = 0.12

The above calculation tells us that if Allen is late, then the probability that there is a train strike is 0.12. We can use this updated belief of the train strike to calculate the probability of Kelvin being late.

P(K=Y) = P(K=Y,T=Y) + P(K=Y,T=N) = P(K=Y|T=Y)*P(T=Y) + P(K=Y|T=N)*P(T=N)

= 0.6*0.12 + 0.1*0.88 = 0.16

This gives a slight increase in the probability of Kelvin being late. So, the evidence that Allen is late propagates through the network and changes the belief of the train strike and of Kelvin being late.

4. Constructing a discrete Bayesian Belief Network

BBN can be constructed using only continuous variables, only categorical variables, or a mix of variables. Here, we will discuss discrete BBN, which is built using categorical variables only. There are two major constituents to constructing a BBN: Structure Learning and Parameter Learning.

Structure Learning

Structure learning is the basis of Bayesian Belief Network analysis. The effectiveness of the solution depends on the optimality of the learned structure. We can use the following approaches:

1. Create a structure based on domain knowledge and expertise.

2. Create an optimal local structure using machine learning algorithms. Please note that an optimal global structure is an NP-hard problem. There are many algorithms to learn the structure like K2, hill climbing and tabu, etc. You can learn more about these from the bnlearn package in R. Python aficionados can refer to the following link- https://pgmpy.chrisittner.de/

3. Create a structure using a combination of both the above approaches- use machine learning techniques to build the model, and with the reduced set of explanatory variables, use domain knowledge/expertise to create the structure. Of all the three, this is the quickest and most effective way.

Parameter Learning

Another major component of BBN is the Conditional Probability Table (CPT). Since each node in the structure is a random variable, it can take multiple values/states. Each state will have some probability of occurrence. We call these probabilities as Beliefs. Also, each node is connected to other nodes in the network. As per the structure, we learn the conditional probability of each state of a node. The tabular form of all such probabilities is called CPT.

Conclusion

This blog aims to equip you with the bare minimum concepts that are required to construct a discrete BBN and understand its various components. Structure learning and Parameter learning are the two major components that are necessary to build a BBN. The concepts of Bayes theorem, joint and marginal probability work as the base for the Network while the propagation of evidence is required to understand the functionality of BBN.

BBN can be used like any other machine learning technique. However, it works best where there are interdependencies among the predictors, and the number of predictors is less. The best part of the BBN is its intuitive way of explaining the drivers of evidence.

Stay tuned to this space to learn how this concept can be applied for event prediction, driver analysis, and intervention assessment of a given classification problem.

The post A Beginner’s Guide: Enter the World of Bayesian Belief Networks appeared first on Tiger Analytics.

How to Implement ML Models: Azure and Jupyter for Production

onemg — Thu, 28 May 2020 20:46:56 +0000

Introduction

As Data Scientists, one of the most pressing challenges we have is how to operationalize machine learning models so that they are robust, cost-effective, and scalable enough to handle the traffic demand. With advanced cloud technologies and serverless computing, there are now cost-effective (pay based on usage) and auto-scalable platforms (with scale-in/scale-out architecture depending on the traffic) available. Data scientists can use these to accelerate the machine learning model deployment without having to worry about the infrastructure.

This blog discusses one such methodology of implementing the machine learning code and model developed locally using Jupyter notebook in the Azure environment for real-time predictions.

ML Implementation Architecture

We have used Azure Functions to deploy the Model Scoring and Feature Store Creation codes into production. Azure Functions is a FaaS offering (Function as a Service or FaaS provides event-based, serverless computing to accelerate development without having to worry about the infrastructure). Azure Functions comes with some interesting functionalities like-

1. Choice of Programming Languages

You can work with any language of your choice- C#, Node.js, Java, Python

2. Event-driven and Scalable

You can use built-in triggers and bindings such as http trigger, event trigger, timer trigger, and queue trigger to define when a function is invoked. The architecture is scalable, depending on the workload.

ML Implementation process

Once the code is developed, the following are the best practices to make the machine learning code production-ready. Below are the steps to deploy the Azure Function.

ML Implementation Process

Azure Function Deployment Steps Walkthrough

Visual Studio Code editor with Azure Function extension is used to create a serverless HTTP endpoint with Python.

1. Sign in to Azure

2. Create a New Project. In the prompt that shows up, select the Language as Python, Trigger as http trigger (based on the requirement)

3. Azure Function is created, and the folder structure is as follows. Write your logic or copy the code if already developed into __init__.py

4. function.json, triggered by http trigger, defines the bindings in this case

5. local settings.json contains all the environmental variables used in the code as a key-value pair

6. requirements.txt contains all libraries that need to be pip installed

7. As the model is stored in Blob, add the following line of code to read from Blob

8. Read the Feature Store data from Azure SQL DB

9. Test locally. Choose Debug -> Start Debugging; it will run locally and give a local API endpoint

10. Publish to Azure Account using the following

func azure functionapp publish functionappname functionname — build remote — additional-packages “python3-dev libevent-dev unixodbc-dev build-essential libssl-dev libffi-dev”

11. Log in to Azure Portal and go to Azure Functions resource to get the API endpoint for Model Scoring

Conclusion

This API can also be integrated with front-end applications for real-time predictions.

Happy Learning!

The post How to Implement ML Models: Azure and Jupyter for Production appeared first on Tiger Analytics.