Merge branch 'feature/reInvent2021_data_gen' into develop

aws-samples · Nov 29, 2021 · 0d176f7 · 0d176f7
2 parents 68013da + fee146f
commit 0d176f7
Show file tree

Hide file tree

Showing 24 changed files with 9,863 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -3,8 +3,15 @@ This project is a simple ISO20022 message generator that generates pacs.008 xml
 project is called [rapide](https://dictionary.cambridge.org/dictionary/french-english/rapide), a French word. It translates to 
 [swift or fast or quick](https://dictionary.cambridge.org/dictionary/english-french/swift) or swift, a swallow like bird. 
 
-The Java package names start with rapide at the root. You will notice references to rapide in different places such as 
-name of the wrapper bash shell script for the CLI tool.
+The Java package names start with `rapide` at the root. You will notice references to rapide in different places such as 
+name of the wrapper bash shell script for the CLI tool.  
+
+This repository contains source code for iso20022 message generator tool (`rapide`) and machine learning prototype models. This 
+page has instructions for building and using `rapide`. The code for machine learning prototype models in form of Python notebooks 
+is in [ml-models](ml-models) directory. The [readme](ml-models/README.md) in that directory includes instructions for using 
+Python notebooks to build, train, deploy and use trained model for realtime and batch inferences. 
+These models help in predicting if an ISO20022 pacs.008 XML message will be successfully processed (Success) or fail 
+processing (Failure) leading to exception processing.
 
 ## Table of Contents
 

diff --git a/ml-models/README.md b/ml-models/README.md
@@ -0,0 +1,123 @@
+# ML Prototypes - ML Models for ISO20022 PACS008 Message Processing Prediction
+
+This directory contains machine learning model prototypes that help in predicting whether an ISO20022 pacs.008 
+XML message will be successfully processed (`Success`) or fail processing (`Failure`) leading to exception processing. 
+Amazon SageMaker built-in machine learning algorithms XGBoost and Linear Learner are used to train two different model.
+Amazon SageMaker Autopilot is used demonstrate automated machine learning (AutoML) that reduces the effort required to 
+build, train, tune and deploy a model.  
+
+The prototype models' prediction is in form of a tuple [(1=Success, 0=Failure), Probability Score] where probability score is probability of the predicted 
+outcome.
+
+## Directory Breakdown
+The project consists of Python notebooks in following directories:
+1. [pacs008/synthetic-data](pacs008/synthetic-data): This directory contains two notebooks for generating synthetic data for the ML prototype. 
+   * iso20022_lei_bic_datasets.ipynb: This notebook generates fake BIC database as a csv file and LEI database again as a
+     csv file. Read the notebook for details. There two generated csv files that can be used for ML prototype. These csv 
+     files were used in pacs.008 XML message generation by the iso20022 message generator tool by this prototype.
+   * gen_pacs008_synthetic_dataset.ipynb: This notebook creates synthetic raw and labeled dataset using ISO20022 pacs.008 
+     XML messages generated by [ISO20022 Message Generator tool]( ../../iso20022-message-generator).
+2. [pacs008/automl](pacs008/automl): This directory contains notebooks that use Amazon SageMaker Autopilot service to train ML models:
+   * pacs008_automl_model_training.ipynb: Prototype model using Amazon SageMaker Autopilot service. It uses labeled 
+     synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb) 
+     to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
+   * pacs008_automl_model_deployment.ipynb: This notebook deploys the best performing ML model from ML SageMaker Autopilot 
+     training job.
+   * automl_batch_transform_example.ipynb: Demonstrates batch inference using Amazon SageMaker Batch Transform service.
+3. [pacs008/xgboost](pacs008/xgboost): This directory contains notebooks that use Amazon SageMaker XGBoost built-in algorithm to train an ML model.
+   * pacs008_xgboost_inference_pipeline.ipynb: A notebook that trains an ML model using Amazon SageMaker XGBoost built-in algorithm.
+     After training the model is deployed to an Amazon SageMaker Inference Endpoint. It uses Amazon SageMaker Inference Pipeline 
+     to deploy scikit-learn container for data transformation and XGBoost model for inference.
+   * pacs008_xgboost_local.ipynb: This notebooks demonstrates data analysis and feature engineering of a text feature 
+     (`InstrForNxtAgt`). The text feature is transformed into numeric representation using text preprocessing techniques 
+     such as word frequency count, term frequency-inverse document frequency (TFIDF) and Multinomial Naive Bayes model to 
+     understand how text can help in predictions. The approach was used to write custom scikit-learn transformers to transform
+     text to numeric features (feature engineering).
+   * xgb_batch_transform_example.ipynb: A notebook that demonstrates use of Amazon SageMaker Batch Transform for batch inference 
+     using XGBoost model.
+4. [pacs008/linear-learner](pacs008/linear-learner): This directory contains notebooks that use Amazon SageMaker Linear Learner built-in algorithm to train ML models.
+   * pacs008_linear_learner_inference_pipeline.ipynb: A notebook that trains an ML model using Amazon SageMaker Linear Learner built-in algorithm.
+     After training the model is deployed to an Amazon SageMaker Inference Endpoint. It uses Amazon SageMaker Inference Pipeline 
+     to deploy scikit-learn container for data transformation and Linear Learner model for inference.
+   * ll_batch_transform_example.ipynb: A notebook that demonstrates use of Amazon SageMaker Batch Transform for batch inference 
+     using a Linear Learner model.
+5. [pacs008/sklearn-transformers](pacs008/sklearn-transformers): This directory contains custom scikit-learn transformers that are used in data preprocessing and 
+   feature engineering tasks. These transformers are deployed in scikit-learn containers in SageMaker Inference Pipeline
+   that is used to deploy XGBoost and Linear Learner trained models.
+   * pacs008_sklearn_featurizer.py: Implements data preprocessing and featurizing features using scikit-learn pipeline and 
+     ColumnTransformer. This transforms and prepares data before training jobs and before using features in pacs.008 XML 
+     message for inference, either using realtime inference via Inference Endpoint or batch inference via Batch Transform.
+   * pacs008_sklearn_transformer.py: Implements scikit-learn custom transformers.
+
+These notebooks should be run using a Python3 Jupyter notebook kernel.  This can be `Python 3 (Data Science)` kernel in 
+SageMaker Studio or `conda_python3` kernel in SageMaker notebook instance.
+
+## Machine Learning Prototype development approach
+The diagram below captures the ML lifecycle used to develop the ML prototype:
+
+![ISO20022 CBPR+ Message Processing Predictor](docs/images/iso20022-cbpr-ml-lifecycle.png)
+
+To get started with the ML prototype model, follow the steps below:   
+
+**Note**: The githib repository includes [iso20022-raw-messages.tar.gz](pacs008/synthetic-data/iso20022-data/iso20022-raw-messages.tar.gz) 
+to get you stated quickly. If you want to use that set raw ISO20022 pacs.008 XML messages, you can skip steps 1 and 2.   
+
+1. Generate ISO20022 pacs.008 XML messages using the [ISO20022 Message Generator tool](../../iso20022-message-generator).
+   ```bash
+   rapide-iso20022 -n 50000 -d messages
+   ```
+
+2. Gzip the messages directory upload to Amazon SageMaker notebook (either Amazon SageMaker Studio notebook or Amazon 
+   SageMaker Notebook instance) in the `iso20022-message-generator/ml-models/pacs008/synthetic-data/iso20022-data` directory 
+   i.e. directory where you cloned this github repository. 
+
+3. Now you have raw ISO20022 pacs.008 XML messages. Next step is to generate Synthetic raw and labeled raw datasets for 
+   use in ML model training. To do this use [](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb) notebook to 
+   generate raw and raw labeled dataset. These raw datasets are further split into training and test datasets in each of
+   the model training notebooks.
+
+5. You can use Amazon SageMaker Autopilot or Amazon Sagemaker XGBoost or Amazon SageMaker Linear Learner or all notebooks 
+   to train and deploy an ML model. After deployment, you can test it by sending a test message to the Inference Endpoint. 
+   Training all models will allow you to compare and evaluate model performance using each of the approaches.
+
+6. You can also use SageMaker Batch Transform notebook to perform batch inference using the training model and 
+   evaluate model's performance.
+
+### Train, Deploy and Test an ML model using Amazon SageMaker Autopilot 
+Use the included notebooks in the following order:  
+
+1. Train an ML models using `pacs008_xgboost_inference_pipeline.ipynb` notebook. Prototype a model using Amazon SageMaker XGBoost
+   algorithm. It uses labeled synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb) 
+   to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
+2. Deploy and test the trained model using `pacs008_xgboost_inference_pipeline.ipynb` notebook. 
+3. Perform batch inference using the trained model using `automl_batch_transform_example.ipynb`. This notebook evaluates 
+   model by computing confusion matrix. You can compare the Autopilot generated model to other two models which are feature engineered by hand and then trained.
+
+
+### Train, Deploy and Test an ML model using Amazon SageMaker XGBoost 
+Use the included notebooks in the following order:  
+
+1. Train an ML models using `pacs008_automl_model_training.ipynb` notebook.: Prototype model using Amazon SageMaker Autopilot service. It uses labeled 
+     synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb) 
+     to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
+2. Deploy and test the trained model using `pacs008_automl_model_deployment.ipynb` notebook. This notebook deploys 
+   the XGBoost model from ML SageMaker Autopilot training job.
+3. Perform batch inference using the trained model using `xgb_batch_transform_example.ipynb`. This notebook evaluates the
+   model by computing confusion matrix. You can compare the XGBoost trained model to Autopilot and Linear Learner models.
+   Linear learner model is feature engineered by hand and then trained using same feature engineering code.
+4. You can read and execute the `pacs008_xgboost_local.ipynb` notebook. It runs locally (does not use SageMaker XGBoost training). 
+   The notebook demonstrates feature engineering for text features using scikit-learn transformers, text preprocessing, text 
+   feature engineering approach and check along the way if new features derived from the text feature help in improving 
+   prediction.
+
+### Train, Deploy and Test an ML model using Amazon SageMaker Linear Learner 
+Use the included notebooks in the following order:  
+
+1. Train an ML models using `pacs008_linear_learner_inference_pipeline.ipynb` notebook.: Prototype model using Amazon SageMaker Autopilot service. It uses labeled 
+     synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb) 
+     to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
+2. Deploy and test the trained model using `pacs008_linear_learner_inference_pipeline.ipynb` notebook. The training notebook 
+   also deploys the trained model using SageMaker Inference Pipeline.
+3. Perform batch inference using the trained model using `ll_batch_transform_example.ipynb`. This notebook evaluates the
+   model by computing confusion matrix. You can compare the Linear Learner trained model to Autopilot and XGBoost models.
+   As mentioned XGBoost and Linear learner models are feature engineered by hand and then trained using same feature engineering code.
diff --git a/ml-models/docs/images/iso20022-cbpr-ml-lifecycle.png b/ml-models/docs/images/iso20022-cbpr-ml-lifecycle.png