AWS Archives - Tiger Analytics

Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security

TA@2023 — Thu, 07 Mar 2024 07:03:07 +0000

Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing. Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

Automate the discovery of sensitive data.
Discover a variety of sensitive data types.
Evaluate and monitor data for security and access control.
Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
- triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
- pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
- checkJobStatus: This Lambda function checks the status of the Macie scan job.
- isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
- waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
- UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.

A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
- We create a one-time Macie job through the triggerMacieJob Lambda function.
- We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
- While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.
  
  The findings can be seen on the AWS console and filtered based on S3 Buckets. We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.
The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
- In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
- In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.
  For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848
  
  The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.
- If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
- In the quarantine bucket, the same file is being moved after finding the sensitive data.
  
  If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
- GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
- StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
- StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
- StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

How to Design your own Data Lake Framework in AWS

TA@2023 — Mon, 29 Aug 2022 17:09:37 +0000

“Data is a precious thing and will last longer than the systems themselves.”– Tim Berners-Lee, inventor of the World Wide Web

Organizations spend a lot of time and effort building pipelines to consume and publish data coming from disparate sources within their Data Lake. Most of the time and effort in large data initiatives are consumed in data ingestion development.

What’s more, with an increasing number of businesses migrating to the cloud, factors like breaking data silos and enhancing data discoverability of data environments have become a business priority.

While Data Lake is the heart of data operations, one should carefully tie capabilities like data security, data quality, metadata-store, etc within the ecosystem.

Properties of an Enterprise Data Lake solution

In a large-scale organization, the Data Lake should possess these characteristics:

Data Ingestion- the ability to consume structured, semi-structured, and unstructured data
Supports push (batch and streaming systems) and pull (DBs, APIs, etc.) mechanisms
Data security through sensitive data masking, tokenization, or redaction
Natively available rules through the Data Quality framework to filter impurities
Metadata Store, Data Dictionary for data discoverability and auditing capability
Data standardization for common data format

A common reusable framework is needed to reduce the time and effort in collecting and ingesting data. At Tiger Analytics, we are solving these problems by building a scalable platform within AWS using AWS’s native services and open-source tools. We’ve adopted a modular design and loosely coupled multi-layered architecture. Each layer provides a distinctive capability and communicates with each other via APIs, messages, and events. The platform abstracts complex processes in the backend and provides a simple easy-to-use UI for the stakeholders

Self-service UI to quickly configure data workflows
Configuration-based backend processing
AWS cloud native and open-source technologies
Data Provenance: data quality, data masking, lineage, recovery and replay audit trail, logging, notification

Before exploring the architecture, let’s understand a few logical components referenced in the blog.

Sources are individual entities that are registered with the framework. They align with systems that own one or more data assets. The system could be a database, a vendor, or a social media website. Entities registered within the framework store various system properties. For instance, if it is a database, then DB Type, DB URL, host, port, username, etc.
Assets are the entries within the framework. They hold the properties of individual files from various sources. Metadata of source files include column names, data types, security classifications, DQ rules, data obfuscation properties, etc.
Targets organize data as per enterprise needs. There are various domains/sub-domains to store the data assets. Based on the subject area of the data, the files can be stored in their specific domains.

The Design Outline

With the demands to manage large volumes of data increasing year on year, our data fabric was designed to be modular, multi-layered, customizable, and flexible enough to suit individual needs and use cases. Whether it is a large banking organization with millions of transactions per day and a strong focus on data security or a start-up that needs clean data to extract business insights, the platform can help everyone.

Following the same modular and multi-layered design principle, we, at Tiger, have put together the architecture with the provision of swapping out components or tools if needed. Keeping in mind that the world of technology is ever-changing and volatile we’ve built flexibility into the system.

UI Portal provides a user-friendly self-service interface to set up and configure sources, targets, and data assets. These elements drive the data consumption from the source to Data Lake. These self-service applications allow the federation of data ingestion. Here, data owners manage and support their data assets. Teams can easily onboard data assets without building individual pipelines. The interface is built using ReactJS with Material-UI, for high-quality front-end graphics. The portal is hosted on an AWS EC2 instance, for resizable compute capacity and scalability.
API Layer is a set of APIs which invokes various functionalities, including CRUD operations and AWS service setup. These APIs create the source, asset, and target entities. The layer supports both synchronous and asynchronous APIs. API Gateway and Lambda functions provide the base of this component. Moreover, DynamoDB captures events requested for audit and support purposes.
Config and Metadata DB is the data repository to capture the control and configuration information. It holds the framework together through a complex data model which reduces data redundancy and provides quick query retrieval. The framework uses AWS RDS PostgreSQL which natively implements Multiversion Concurrency Control (MVCC). It provides point-in-time consistent views without read locks, hence avoiding contentions.
Orchestration Layer strings together various tasks with dependencies and relationships within a data pipeline. These pipelines are built on Apache Airflow. Every data asset has its pipeline, thereby providing more granularity and control over individual flows. Individual DAGs are created through an automated process called DAG-Generator. It is a python-based program tied to the API that registers data assets. Every time a new asset is registered, the DAG-Generator creates a DAG based on the configurations. Later, they are uploaded to the Airflow Server. These DAGs may be time-driven or event-driven, based on the source system.
Execution Layer is the final layer where the magic happens. It comprises various individual python-based programs within AWS Glue jobs. We will be seeing more about this in the following section.

Data Pipeline (Execution Layer)

A data pipeline is a set of tools and processes that automate the data flow from the source to a target repository. The data pipeline moves data from the onboarded source to the target system.

Figure 3: Concept Model – Execution Layer

Data Ingestion

Several patterns affect the way we consume/ingest data. They vary depending on the source systems and consuming frequency. For instance, ingesting data from a database requires additional capabilities compared to consuming a file dropped by a third-party vendor.

Figure 4: Data Ingestion Quadrant

The Data Ingestion Quadrant is our base outline to define consuming patterns. Depending on the properties of the data asset, the framework has the intelligence to use the appropriate pipeline for processing. To achieve this, we have individual S3 buckets for time-driven and event-driven sources. Driver lambda function externally invokes event-driven Airflow DAGs and CRON expressions within DAGs invoke time-driven schedules.

These capabilities consume different file formats like CSV, JSON, XML, parquet, etc. Connector libraries are used to pull data from various databases like MySQL, Postgres, Oracle, and so on.

Data Quality

Data is the core component of any business operation. Data Quality (DQ) in any enterprise system determines its success. The data platform requires a robust DQ framework that promises quality data in enterprise repositories.

For this framework, we have the AWS open-source library called DEEQU. It is a library built on top of Apache Spark. Its Python interface is called PyDeequ. DEEQU provides data profiling capabilities, suggests DQ rules, and executes several checks. We have divided DQ checks into two categories:

– Default Checks are the DQ rules that automatically apply to the attributes. For instance, Length Check, Datatype Check, and Primary Key Check. These data asset properties are defined while registering in the system.

– Advanced Checks are the additional DQ rules. They are applied to various attributes based on the user’s needs. The user defines these checks and stores them in the metadata.

The DQ framework pulls these checks from the metadata store, and it identifies the default checks through data asset properties. Eventually, it constructs a bulk check module for data execution. DQ Results are stored in the backend database. The logs are stored in the S3 bucket for detailed analysis. DQ summary available in the UI provides additional transparency to business users.

Data Obfuscation/Masking

Data masking is the capability of dealing with sensitive information. While registering a data asset, the framework has a provision to enable tokenization on sensitive columns. The Data Masking task uses an internal algorithm and a key (associated with the framework and stored in the AWS Secret Manager). It tokenizes those specific columns before storing them in the Data Lake. These attributes can be detokenized through user-defined functions. It also requires additional key access to control attempts by unauthorized users.

The framework also supports other forms of irreversible data obfuscation, such as Redaction and Data Perturbation.

Data Standardization

Data standardization brings data into a common format. It allows data accessibility using a common set of tools and libraries. The framework executes standardized operations for data consistency. The framework can, therefore:

Standardize target column names.
Support file conversion to parquet format.
Remove leading zeroes from integer/decimal columns.
Standardize target column datatypes.
Add partitioning column.
Remove leading and trailing white spaces from string columns.
Support date format standardization.
Add control columns to target data sets.

Through this blog, we’ve shared insights on our generic architecture to build a Data Lake within the AWS ecosystem. While we can keep adding more capabilities to solve real-world problems, this is just a glimpse of data challenges that can be addressed efficiently through layered and modular design. You can use these learnings to put together the outline of a design that works for your use case while following the same core principles.

The post How to Design your own Data Lake Framework in AWS appeared first on Tiger Analytics.

Revolutionizing SMB Insurance with AI-led Underwriting Data Prefill Solutions

onemg — Wed, 29 Sep 2021 17:10:55 +0000

Small-and-medium-sized businesses often embark on unrewarding insurance journeys. There are about 28 million such businesses in the US that require at least 4-5 types of insurance. Over 70% of them are either underinsured or have no insurance at all. One reason is that their road to insurance coverage can be long, complex, and unpredictable. While filling out commercial insurance applications, SMB owners face several complicated questions for which crucial information is either not readily available or poorly understood. Underwriters, however, need this information promptly to estimate risks associated with extending the coverage. It makes the overall commercial underwriting process extremely iterative, time-consuming, and labor-intensive.

For instance, business owners need to answer over 40 different questions when they apply for worker’s compensation insurance. In addition, it could take many weeks of constant emailing between insurance companies and businesses after submission! Such bottlenecks lead to poor customer experiences while significantly impacting the quote-to-bind ratio for insurers. Furthermore, over 20% of the information captured from businesses and agents is inaccurate – resulting in premium leakage and poor claims experience.

The emergence of data prefill – and the challenges ahead

Today, more insurers are eager to pre-populate their commercial underwriting applications by using public and proprietary data sources. The data captured from external sources help them precisely assess risks across insurance coverages, including Workers Compensation, General Liability, Business Property, and Commercial Auto. For example, insurers can explore company websites and external data sources like Google Maps, OpenCorporates, Yelp, Zomato, Trip Advisor, Instagram, Foursquare, Kompass, etc. These sources provide accurate details, such as year of establishment, industry class, hours of operation, workforce, physical equipment, construction quality, safety standards, and more.

However, despite the availability of several products that claim to have successfully prefilled underwriting data, insurance providers continue to grapple with challenges like evolving business needs and risks, constant changes in public data format, ground truth validation, and legal intricacies. Sources keep evolving over time both in terms of structure and data availability. Some even come with specific legal constraints. For instance, scraping is prohibited by many external websites. Moreover, the data prefill platform needs to fetch data from multiple sources, which requires proper source prioritization and validation.

Insurers have thus started to consider building custom white-box solutions that are configurable, scalable, efficient, and compliant.

Creating accurate, effortless, and fast commercial underwriting journeys

The futuristic data prefill platforms can empower business insurance providers to prefill underwriting information effortlessly and accurately. These custom-made platforms are powered by state-of-art data matching and extraction frameworks, a suite of advanced data science techniques, triangulation algorithms, and scalable architecture blueprints. The platform empowers underwriters to directly extract data from external sources with a high fill rate and great speed. Where the data is not directly available, the ML classifiers help predict underwriting questions for underwriters with high accuracy.

Tiger Analytics has assisted in custom-building such AI-led underwriting data prefill solutions to support various commercial underwriting decisions for leading US-based Worker’s compensation insurance providers. Our data prefill solution uses various AWS services such as AWS Lambda, S3, EC2, Elastic Search, Sage maker, Glue, Cloudwatch, RDS, and API Gateway; which ensures increased speed-to-market and scalability – with improvements gained through incremental addition of each source. It is a highly customizable white-box solution with a built-in Tiger’s philosophy of Open IP. Using AWS services allows the solution to be quickly and cost-effectively tweaked to cater to any changes in external source formats. Delivered as an AWS cloud-hosted solution, this solution uses AWS Lambda architecture to enable scale and state-of-the-art application orchestration engine to prefill data for commercial underwriting purposes.

Key benefits

Unparalleled accuracy of 95% on all the data provided by the platform
Over 90% fill rate
Significant cost savings of up to $10 million annually
Accelerated value creation by enabling insurers to start realizing value within 3-6 months

Insurers must focus on leveraging external data sources and state-of-the-art AI frameworks, data science models, and data engineering components to prefill applications. And with the right data prefill platform, insurers can improve the overall quote-to-bind ratio, assess risks accurately and stay ahead of the competition.

The post Revolutionizing SMB Insurance with AI-led Underwriting Data Prefill Solutions appeared first on Tiger Analytics.

REST API with AWS SageMaker: Deploying Custom Machine Learning Models

onemg — Thu, 17 Sep 2020 11:22:56 +0000

Introduction

AWS SageMaker is a fully managed machine learning service. It provides you support to build models using built-in algorithms, with native support for bring-your-own algorithms and ML frameworks such as Apache MXNet, PyTorch, SparkML, TensorFlow, and Scikit-Learn.

Why AWS SageMaker?

Developers or Data Scientists need not worry about infrastructure management and cluster utilization and can experiment with different things.
Supports end-to-end machine workflow with integrated Jupyter notebooks, data labeling, hyperparameter optimization, hosting scalable inference endpoints with autoscaling to handle millions of requests.
Provides standard machine learning models, which are optimized to run against extremely large data in a distributed environment.
Multi-model training across multiple GPUs and leverages spot instances to lower the training cost.

Note: You cannot use SageMaker’s built-in algorithms for all the cases, especially when you have custom algorithms that require building custom containers.

This post will walk you through the process of deploying a custom machine learning model (bring-your-own-algorithms), which is trained locally, as a REST API using SageMaker, Lambda, and Docker. The steps involved in the process are shown in the image below-

The process consists of five steps-

Step 1: Building the model and saving the artifacts.
Step 2: Defining the server and inference code.
Step 3: Building a SageMaker Container.
Step 4: Creating Model, Endpoint Configuration, and Endpoint.
Step 5: Invoking the model using Lambda with API Gateway trigger.

Step 1: Building the Model and Saving the Artifacts

First, we have to build the model and serialize the object, which is then used for prediction. In this post, we are using simple Linear Regression (i.e., one independent variable). Once you serialize the Python object to Pickle file, you have to save that artifact (pickle file) in tar.gz format and upload it to the S3 bucket.

Step 2: Defining the Server and Inference Code

When an endpoint is invoked, SageMaker interacts with the Docker container, which runs the inference code for hosting services, processes the request, and returns the response. Containers have to implement a web server that responds to /invocations and /ping on port 8080.

Inference code in the container will receive GET requests from the infrastructure, and it should respond to SageMaker with an HTTP 200 status code and an empty body, which indicates that the container is ready to accept inference requests at invocations endpoints.

Code: https://gist.github.com/NareshReddyy/9f1f9ab7f6031c103a0392d52b5531ad

To make the model REST API enabled, you need Flask, which is WSGI(Web Server Gateway Interface) application framework, Gunicorn the WSGI server, and nginx the reverse-proxy and load balancer.

Code: https://github.com/NareshReddyy/Sagemaker_deploy_own_model.git

Step 3: Building a SageMaker Container

SageMaker uses Docker containers extensively. You can put your scripts, algorithms, and inference codes for your models in these containers, which includes the runtime, system tools, libraries, and other code to deploy your models, which provides flexibility to run your model. The Docker images are built from scripted instructions provided in a Dockerfile.

The Dockerfile describes the image that you want to build with a complete installation of the system that you want to run. You can use standard Ubuntu installation as a base image and run the normal tools to install the things needed by your inference code. You will have to copy the folder(Linear_regx) where you have the nginx.conf, predictor.py, serve and wsgi.py to /opt/code and make it your working directory.

The Amazon SageMaker Containers library places the scripts that the container will run in the /opt/ml/code/directory

Code: https://gist.github.com/NareshReddyy/2aec71abf8aca6bcdfb82052f62fbc23

To build a local image, use the following command-

docker build

Create a repository in AWS ECR and tag the local image to that repository.

The repository has the following structure:

.dkr.ecr..amazonaws.com/:

docker tag :

Before pushing the repository, you have to configure your AWS CLI and login

Once you execute the above command you will see something like docker login -u AWS -p xxxxx. Use this command to log in to ECR.

docker push :

Step 4: Creating Model, Endpoint Configuration, and Endpoint

Creating models can be done using API or AWS management console. Provide Model name and IAM role.

Under the Container definition, choose Provide model artifacts and inference image location and provide the S3 location of the artifacts and Image URI.

After creating the model, create Endpoint Configuration and add the created model.

When you have multiple models to host, instead of creating numerous endpoints, you can choose Use multiple models to host them under a single endpoint (this is also a cost-effective method of hosting).

You can change the instance type and instance count and enable Elastic Interface(EI) based on your requirement. You can also enable data capture, which saves prediction requests and responses in the S3 bucket, thereby providing options to set alerts for when there are deviations in the model quality, such as data drift.

Create Endpoint using the existing configuration

Step 5: Invoking the Model Using Lambda with API Gateway Trigger

Create Lambda with API Gateway trigger.

In the API Gateway trigger configuration, add a REST API to your Lambda function to create an HTTP endpoint that invokes the SageMaker endpoint.

In the function code, read the request received from the API gateway and pass the input to the invoke_endpoint and capture and return the response to the API gateway.

When you open the API gateway, you can see the API created by the Lambda function. Now you can create the method required (POST) and integrate the Lambda function, and test by providing the input in the request body and check the output.

You can test your endpoint either by using SageMaker notebooks or Lambda.

Conclusion

SageMaker enables you to build complex ML models with a wide variety of options to build, train, and deploy in an easy, highly scalable, and cost-effective way. Following the above illustration, you can deploy a machine learning model as a serverless API using SageMaker.

The post REST API with AWS SageMaker: Deploying Custom Machine Learning Models appeared first on Tiger Analytics.

Building Data Engineering Solutions: A Step-by-Step Guide with AWS

onemg — Thu, 14 Feb 2019 18:10:43 +0000

Introduction:

Lots of small to midsize companies use Analytics to understand business activity, lower their costs and increase their reach. Some of these companies may intend to build and maintain an Analytics pipeline but change their mind when they see how much money and tech know-how it takes. For any enterprise, data is an asset. And they are unwilling to share this asset with external players: they might end up risking their market advantage. To extract maximum value from intelligence harvesting, enterprises need to build and maintain their own data warehouses and surrounding infrastructure.

The Analytics field is buzzing with talks on applications related to Machine Learning, which have complex requirements like storing and processing unstructured streaming data. Instead of pushing themselves towards advanced analytics, companies can extract a lot of value simply by using good reporting infrastructure. This is because currently a lot of SME activity is still at the batch data level. From an infrastructure POV, cloud players like Amazon Web Services (AWS) and Microsoft Azure have taken away a lot of complexity. This has enabled companies to implement an accurate, robust reporting infrastructure (more or less) independently and economically. This article is about a specific lightweight implementation of Data Engineering using AWS, which would be perfect for an SME. By the time you finish reading this, you will:

1) Understand the basics of a simple Data Engineering pipeline
2) Know the details of a specific kind of AWS-based Analytics pipeline
3) Apply this design thinking to a similar problem you may come across

Analytics Data Pipeline:

SMEs have their business activity data stored in different places. Getting it all together so that a broad picture of the business’s health emerges is one of the big challenges in analytics. Gathering data from sources, storing it in a structured and accurate manner, then using that data to create reports and visualizations can give SMEs relatively large gains. From a process standpoint, this is what it might look like:

Figure 1: Simple Data Pipeline

But from a business activity effort standpoint, it’s more like:

Figure 2: Business Activity involved in a Data Pipeline

Here’s what’s interesting: although the first two components of the process consume most time and effort, when you look at it from a value chain standpoint, value is realized in the Analyze component.

Figure 3: Analytics Value Chain

The curiously inverse relationship between effort and value keeps SMEs wondering if they will realize the returns they expect on their investment and minimize costs. Analytics today might seem to be all about Machine Learning and cutting-edge technology, but SMEs can realize a lot of value by using relatively simple analytics like:

1) Time series graph on business activity for leadership
2) Bar graph visualization for sales growth over the years
3) For the Sales team: a refreshed, filterable dashboard showing the top ten clients over a chosen time period
4) For the Operations team: an email blast every morning at eight depicting business activity expense over a chosen time period

Many strategic challenges that SMEs face, like business reorganization, controlling operating costs, crisis management, require accurate data to solve. Having an Analytics data pipeline in the cloud allows enterprises to take cost-optimized, data-driven decisions. These can include both strategic decision-making for C-Suite and business-as-usual metrics for the Operations and Sales teams, allowing executives to track their progress. In a nutshell, an Analytics data pipeline makes company information accessible to executives. This is valuable in itself because it enables metrics monitoring (including the derived benefits like forecasting predictions). There you have it, folks: a convincing case for SMEs to experiment with building an in-house Analytics pipeline.

Mechanics of the pipeline:

Before we get into vendors and the value they bring, here’s something for you to think about: there are as many ways to build an Analytics pipeline as there are stars in the sky. The challenge here is to create a data pipeline that is hosted on a secure cloud infrastructure. It’s important to use cloud-native compute and storage components so that the infrastructure is easy to build and operate for an SME.

Usually, source data for SMEs are in the following formats:

1) Payment information stored in Excel
2) Business activity information coming in as API
3) Third-party interaction exported as a .CSV to a location like S3

Using AWS as a platform enables SMEs to leverage the serverless compute feature of AWS Lambda when ingesting the source data into an Aurora Postgres RDBMS. Lambda allows many programming interfaces including Python, a widely used language. Back in 2016-17, the total runtime for Lambda was at five minutes, which was not nearly enough for ETL. Two years later, the limit was increased to 15 minutes. This is still too little time to execute most ETL jobs, but enough for the batch data ingestion requirements of SMEs.

Lambda is usually hosted within a private subnet in the enterprise Virtual Private Cloud (VPC), but it can communicate with third-party source systems through a Network Address Translator (NAT) and Internet Gateways (IG). Python’s libraries (like Pandas) make tabular data quick and easy to process. Once processed, the output dataframe from Lambda is stored onto a table in the Aurora Postgres Database. Aurora prefix is for the AWS flavor of the Postgres Database offering. It makes sense to choose a vanilla relational database because most data is in Excel-type rows and columns format anyway, and reporting engines like Tableau and other BI tools work well with RDBMS engines.

Mapping the components to the process outlined in Figure 1, we get:

Figure 4: Revisiting Analytics pipeline

AWS Architecture:

Let’s take a deeper look into AWS architecture.

Figure 5: AWS-based batch data processing architecture using Serverless Lambda function and RDS database

Figure 5 adds more details to the AWS aspects of a Data Engineering pipeline. Operating on AWS requires companies to share security responsibilities such as:

1) Hosting AWS components with a VPC
2) Identifying public and private subnets
3) Ensuring IG and NAT Gateways can allow components hosted within private subnets to communicate with the internet
4) Provisioning the Database as publicly not accessible
5) Setting aside a dedicated EC2 to route web traffic to this publicly inaccessible database
6) Provisioning security groups for EC2’s public subnet (Lambda in private subnet and Database in DB subnet)
7) Provisioning subnets for app and DB tier in two different Availability Zones (AZ) to ensure (a) DB tier provisioning requirements are met, and (b) Lambda doesn’t run out of IPs when triggered

Running the pipeline:

New data is ingested by timed invocation of Lambda using CloudWatch rules. CloudWatch monitors AWS resources and invokes services at set times using Chron expression. CloudWatch can also be effectively used as a SQL Server Job agent to trigger Lambda events. This accommodates activities with different frequencies like:

1) Refreshing sales activity (daily)
2) Operating Costs information (weekly)
3) Payment activity (biweekly)
4) Tax information (monthly)

CloudWatch can deploy a specific Python script (that takes data from the source, does necessary transformations, and loads it onto a table with known structure) to Lambda once the respective source file or data refresh frequency is known.

Moving on to Postgres, its unique Materialized View and SQL Stored procedure feature (that allows further processing) can also be invoked using a combination of Lambda and CloudWatch. This workflow is helpful to propagate base data after refresh into denormalized, wide tables which can store company-wide sales and operations information.

Figure 6: An example of data flow for building aggregate metrics

Once respective views are refreshed with the latest data, we can connect to the Database using a BI tool for reporting and analysis. It’s important to remember that because we are operating on the AWS ecosystem, the Database must be provisioned as publicly inaccessible and be hosted within a private subnet. Users should only be able to reach it through a web proxy, like nginx or httpd, that is set up on an EC2 on the public subnet to route traffic within the VPC.

Figure 7: BI Connection flow to DB

Access to data can be controlled at the Database level (by granting or denying access to a specific schema) and at the connection level (by whitelisting specific IPs to allow connections and denying connect access by default).

Accuracy is the name of the game:

So you have a really secure and robust AWS architecture, a well-tested Python code for Lambda executions, and a not-so-cheap BI tool subscription. Are you all set? Not really. You might just miss the bus if inaccuracy creeps into the tables during data refresh. A dashboard is only as good as the accuracy of the numbers it displays. Take extra care to ensure that the schema tables you have designed include metadata columns required to identify inaccurate and duplicate data.

Conclusion:

In this article, we took a narrow-angle approach to a specific Data Engineering example. We saw the Effort vs Return spectrum in the Analytics value chain and the value that can be harvested by taking advantage of the available Cloud options. We noted the value in empowering C-suite leaders and company executives with descriptive interactive dashboards.

We looked at building a specific AWS cloud-based Data Engineering pipeline that is relatively uncomplicated and can be implemented by SMEs. We went over the architecture and its different components and briefly touched on the elements of running a pipeline and finally, on the importance of accuracy in reporting and analysis.

Although we saw one specific implementation in this article, the attempt here is to convey the idea that getting value out of an in-house Analytics pipeline is easier than what it used to be say a decade ago. With open source and cloud tools here to make the journey easy, it doesn’t take long to explore and exploit the value hidden in data.

[References:

Disruptive Analytics, Apress, 2016]

The post Building Data Engineering Solutions: A Step-by-Step Guide with AWS appeared first on Tiger Analytics.