Skip to content

Commit

Permalink
Merge branch 'feature/reInvent2021_data_gen' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
sandwi committed Nov 29, 2021
2 parents 68013da + fee146f commit 0d176f7
Show file tree
Hide file tree
Showing 24 changed files with 9,863 additions and 15 deletions.
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,15 @@ This project is a simple ISO20022 message generator that generates pacs.008 xml
project is called [rapide](https://dictionary.cambridge.org/dictionary/french-english/rapide), a French word. It translates to
[swift or fast or quick](https://dictionary.cambridge.org/dictionary/english-french/swift) or swift, a swallow like bird.

The Java package names start with rapide at the root. You will notice references to rapide in different places such as
name of the wrapper bash shell script for the CLI tool.
The Java package names start with `rapide` at the root. You will notice references to rapide in different places such as
name of the wrapper bash shell script for the CLI tool.

This repository contains source code for iso20022 message generator tool (`rapide`) and machine learning prototype models. This
page has instructions for building and using `rapide`. The code for machine learning prototype models in form of Python notebooks
is in [ml-models](ml-models) directory. The [readme](ml-models/README.md) in that directory includes instructions for using
Python notebooks to build, train, deploy and use trained model for realtime and batch inferences.
These models help in predicting if an ISO20022 pacs.008 XML message will be successfully processed (Success) or fail
processing (Failure) leading to exception processing.

## Table of Contents

Expand Down
123 changes: 123 additions & 0 deletions ml-models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# ML Prototypes - ML Models for ISO20022 PACS008 Message Processing Prediction

This directory contains machine learning model prototypes that help in predicting whether an ISO20022 pacs.008
XML message will be successfully processed (`Success`) or fail processing (`Failure`) leading to exception processing.
Amazon SageMaker built-in machine learning algorithms XGBoost and Linear Learner are used to train two different model.
Amazon SageMaker Autopilot is used demonstrate automated machine learning (AutoML) that reduces the effort required to
build, train, tune and deploy a model.

The prototype models' prediction is in form of a tuple [(1=Success, 0=Failure), Probability Score] where probability score is probability of the predicted
outcome.

## Directory Breakdown
The project consists of Python notebooks in following directories:
1. [pacs008/synthetic-data](pacs008/synthetic-data): This directory contains two notebooks for generating synthetic data for the ML prototype.
* iso20022_lei_bic_datasets.ipynb: This notebook generates fake BIC database as a csv file and LEI database again as a
csv file. Read the notebook for details. There two generated csv files that can be used for ML prototype. These csv
files were used in pacs.008 XML message generation by the iso20022 message generator tool by this prototype.
* gen_pacs008_synthetic_dataset.ipynb: This notebook creates synthetic raw and labeled dataset using ISO20022 pacs.008
XML messages generated by [ISO20022 Message Generator tool]( ../../iso20022-message-generator).
2. [pacs008/automl](pacs008/automl): This directory contains notebooks that use Amazon SageMaker Autopilot service to train ML models:
* pacs008_automl_model_training.ipynb: Prototype model using Amazon SageMaker Autopilot service. It uses labeled
synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb)
to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
* pacs008_automl_model_deployment.ipynb: This notebook deploys the best performing ML model from ML SageMaker Autopilot
training job.
* automl_batch_transform_example.ipynb: Demonstrates batch inference using Amazon SageMaker Batch Transform service.
3. [pacs008/xgboost](pacs008/xgboost): This directory contains notebooks that use Amazon SageMaker XGBoost built-in algorithm to train an ML model.
* pacs008_xgboost_inference_pipeline.ipynb: A notebook that trains an ML model using Amazon SageMaker XGBoost built-in algorithm.
After training the model is deployed to an Amazon SageMaker Inference Endpoint. It uses Amazon SageMaker Inference Pipeline
to deploy scikit-learn container for data transformation and XGBoost model for inference.
* pacs008_xgboost_local.ipynb: This notebooks demonstrates data analysis and feature engineering of a text feature
(`InstrForNxtAgt`). The text feature is transformed into numeric representation using text preprocessing techniques
such as word frequency count, term frequency-inverse document frequency (TFIDF) and Multinomial Naive Bayes model to
understand how text can help in predictions. The approach was used to write custom scikit-learn transformers to transform
text to numeric features (feature engineering).
* xgb_batch_transform_example.ipynb: A notebook that demonstrates use of Amazon SageMaker Batch Transform for batch inference
using XGBoost model.
4. [pacs008/linear-learner](pacs008/linear-learner): This directory contains notebooks that use Amazon SageMaker Linear Learner built-in algorithm to train ML models.
* pacs008_linear_learner_inference_pipeline.ipynb: A notebook that trains an ML model using Amazon SageMaker Linear Learner built-in algorithm.
After training the model is deployed to an Amazon SageMaker Inference Endpoint. It uses Amazon SageMaker Inference Pipeline
to deploy scikit-learn container for data transformation and Linear Learner model for inference.
* ll_batch_transform_example.ipynb: A notebook that demonstrates use of Amazon SageMaker Batch Transform for batch inference
using a Linear Learner model.
5. [pacs008/sklearn-transformers](pacs008/sklearn-transformers): This directory contains custom scikit-learn transformers that are used in data preprocessing and
feature engineering tasks. These transformers are deployed in scikit-learn containers in SageMaker Inference Pipeline
that is used to deploy XGBoost and Linear Learner trained models.
* pacs008_sklearn_featurizer.py: Implements data preprocessing and featurizing features using scikit-learn pipeline and
ColumnTransformer. This transforms and prepares data before training jobs and before using features in pacs.008 XML
message for inference, either using realtime inference via Inference Endpoint or batch inference via Batch Transform.
* pacs008_sklearn_transformer.py: Implements scikit-learn custom transformers.

These notebooks should be run using a Python3 Jupyter notebook kernel. This can be `Python 3 (Data Science)` kernel in
SageMaker Studio or `conda_python3` kernel in SageMaker notebook instance.

## Machine Learning Prototype development approach
The diagram below captures the ML lifecycle used to develop the ML prototype:

![ISO20022 CBPR+ Message Processing Predictor](docs/images/iso20022-cbpr-ml-lifecycle.png)

To get started with the ML prototype model, follow the steps below:

**Note**: The githib repository includes [iso20022-raw-messages.tar.gz](pacs008/synthetic-data/iso20022-data/iso20022-raw-messages.tar.gz)
to get you stated quickly. If you want to use that set raw ISO20022 pacs.008 XML messages, you can skip steps 1 and 2.

1. Generate ISO20022 pacs.008 XML messages using the [ISO20022 Message Generator tool](../../iso20022-message-generator).
```bash
rapide-iso20022 -n 50000 -d messages
```

2. Gzip the messages directory upload to Amazon SageMaker notebook (either Amazon SageMaker Studio notebook or Amazon
SageMaker Notebook instance) in the `iso20022-message-generator/ml-models/pacs008/synthetic-data/iso20022-data` directory
i.e. directory where you cloned this github repository.

3. Now you have raw ISO20022 pacs.008 XML messages. Next step is to generate Synthetic raw and labeled raw datasets for
use in ML model training. To do this use [](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb) notebook to
generate raw and raw labeled dataset. These raw datasets are further split into training and test datasets in each of
the model training notebooks.

5. You can use Amazon SageMaker Autopilot or Amazon Sagemaker XGBoost or Amazon SageMaker Linear Learner or all notebooks
to train and deploy an ML model. After deployment, you can test it by sending a test message to the Inference Endpoint.
Training all models will allow you to compare and evaluate model performance using each of the approaches.

6. You can also use SageMaker Batch Transform notebook to perform batch inference using the training model and
evaluate model's performance.

### Train, Deploy and Test an ML model using Amazon SageMaker Autopilot
Use the included notebooks in the following order:

1. Train an ML models using `pacs008_xgboost_inference_pipeline.ipynb` notebook. Prototype a model using Amazon SageMaker XGBoost
algorithm. It uses labeled synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb)
to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
2. Deploy and test the trained model using `pacs008_xgboost_inference_pipeline.ipynb` notebook.
3. Perform batch inference using the trained model using `automl_batch_transform_example.ipynb`. This notebook evaluates
model by computing confusion matrix. You can compare the Autopilot generated model to other two models which are feature engineered by hand and then trained.


### Train, Deploy and Test an ML model using Amazon SageMaker XGBoost
Use the included notebooks in the following order:

1. Train an ML models using `pacs008_automl_model_training.ipynb` notebook.: Prototype model using Amazon SageMaker Autopilot service. It uses labeled
synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb)
to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
2. Deploy and test the trained model using `pacs008_automl_model_deployment.ipynb` notebook. This notebook deploys
the XGBoost model from ML SageMaker Autopilot training job.
3. Perform batch inference using the trained model using `xgb_batch_transform_example.ipynb`. This notebook evaluates the
model by computing confusion matrix. You can compare the XGBoost trained model to Autopilot and Linear Learner models.
Linear learner model is feature engineered by hand and then trained using same feature engineering code.
4. You can read and execute the `pacs008_xgboost_local.ipynb` notebook. It runs locally (does not use SageMaker XGBoost training).
The notebook demonstrates feature engineering for text features using scikit-learn transformers, text preprocessing, text
feature engineering approach and check along the way if new features derived from the text feature help in improving
prediction.

### Train, Deploy and Test an ML model using Amazon SageMaker Linear Learner
Use the included notebooks in the following order:

1. Train an ML models using `pacs008_linear_learner_inference_pipeline.ipynb` notebook.: Prototype model using Amazon SageMaker Autopilot service. It uses labeled
synthetic data generated by [gen_pacs008_synthetic_dataset.ipynb](pacs008/synthetic-data/gen_pacs008_synthetic_dataset.ipynb)
to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
2. Deploy and test the trained model using `pacs008_linear_learner_inference_pipeline.ipynb` notebook. The training notebook
also deploys the trained model using SageMaker Inference Pipeline.
3. Perform batch inference using the trained model using `ll_batch_transform_example.ipynb`. This notebook evaluates the
model by computing confusion matrix. You can compare the Linear Learner trained model to Autopilot and XGBoost models.
As mentioned XGBoost and Linear learner models are feature engineered by hand and then trained using same feature engineering code.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 0d176f7

Please sign in to comment.