From 45214e8b03414ff988bf23fefdf969c2206455d8 Mon Sep 17 00:00:00 2001
From: nallave <116003489+nallave@users.noreply.github.com>
Date: Thu, 5 Dec 2024 14:40:59 +0530
Subject: [PATCH 1/2] Update template.ipynb
---
docs/tutorials/tfx/template.ipynb | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/tutorials/tfx/template.ipynb b/docs/tutorials/tfx/template.ipynb
index bf9592cbd4..79b1a4780b 100644
--- a/docs/tutorials/tfx/template.ipynb
+++ b/docs/tutorials/tfx/template.ipynb
@@ -283,7 +283,7 @@
"id": "ozHIomcd0olB"
},
"source": [
- "TFX includes the [`taxi` template](https://github.com/tensorflow/tfx/tree/master/tfx/experimental/templates/taxi) with the TFX python package. If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.\n",
+ "TFX includes the [`taxi` template](https://github.com/tensorflow/tfx/tree/master/tfx/experimental/templates/taxi) with the TFX python package. If you are planning to solve a point-wise prediction problem, including classification and regression, this template could be used as a starting point.\n",
"\n",
"The `tfx template copy` CLI command copies predefined template files into your project directory."
]
From 237fe758274915fe1cbda5b20b8e713f83a80e11 Mon Sep 17 00:00:00 2001
From: nallave <116003489+nallave@users.noreply.github.com>
Date: Thu, 5 Dec 2024 14:52:03 +0530
Subject: [PATCH 2/2] Commit
---
docs/guide/cli.md | 2 +-
docs/guide/custom_component.md | 2 +-
.../tfx/cloud-ai-platform-pipelines.md | 2 +-
docs/tutorials/tfx/components.ipynb | 3078 ++++++-------
docs/tutorials/tfx/components_keras.ipynb | 3128 +++++++-------
.../vertex_pipelines_vertex_training.ipynb | 2012 ++++-----
.../tfx/gpt2_finetuning_and_conversion.ipynb | 3014 ++++++-------
.../tfx/neural_structured_learning.ipynb | 3808 ++++++++---------
docs/tutorials/tfx/penguin_template.ipynb | 3078 ++++++-------
9 files changed, 9062 insertions(+), 9062 deletions(-)
diff --git a/docs/guide/cli.md b/docs/guide/cli.md
index cadcab772f..d447ba05b5 100644
--- a/docs/guide/cli.md
+++ b/docs/guide/cli.md
@@ -818,7 +818,7 @@ management.
- ${HOME}/tfx/local, beam, airflow, vertex
- Pipeline metadata read from the configuration is stored under
`${HOME}/tfx/${ORCHESTRATION_ENGINE}/${PIPELINE_NAME}`. This location
- can be customized by setting environment varaible like `AIRFLOW_HOME` or
+ can be customized by setting environment variable like `AIRFLOW_HOME` or
`KUBEFLOW_HOME`. This behavior might be changed in future releases. This
directory is used to store pipeline information including pipeline ids
in the Kubeflow Pipelines cluster which is needed to create runs or
diff --git a/docs/guide/custom_component.md b/docs/guide/custom_component.md
index 9527f3bbe2..73257a71cb 100644
--- a/docs/guide/custom_component.md
+++ b/docs/guide/custom_component.md
@@ -69,7 +69,7 @@ class HelloComponentSpec(types.ComponentSpec):
Next, write the executor code for the new component. Basically, a new subclass
of `base_executor.BaseExecutor` needs to be created with its `Do` function
-overriden. In the `Do` function, the arguments `input_dict`, `output_dict` and
+overridden. In the `Do` function, the arguments `input_dict`, `output_dict` and
`exec_properties` that are passed in map to `INPUTS`, `OUTPUTS` and `PARAMETERS`
that are defined in ComponentSpec respectively. For `exec_properties`, the value
can be fetched directly through a dictionary lookup. For artifacts in
diff --git a/docs/tutorials/tfx/cloud-ai-platform-pipelines.md b/docs/tutorials/tfx/cloud-ai-platform-pipelines.md
index 40977a0d05..a8c5a06560 100644
--- a/docs/tutorials/tfx/cloud-ai-platform-pipelines.md
+++ b/docs/tutorials/tfx/cloud-ai-platform-pipelines.md
@@ -769,7 +769,7 @@ gcloud config set project YOUR_PROJECT_ID
gcloud services list --available | grep Dataflow
# If you don't see dataflow.googleapis.com listed, that means you haven't been
-# granted access to enable the Dataflow API. See your account adminstrator.
+# granted access to enable the Dataflow API. See your account administrator.
# Enable the Dataflow service:
diff --git a/docs/tutorials/tfx/components.ipynb b/docs/tutorials/tfx/components.ipynb
index 49959bc8a8..318f0223c0 100644
--- a/docs/tutorials/tfx/components.ipynb
+++ b/docs/tutorials/tfx/components.ipynb
@@ -1,53 +1,53 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wdeKOEkv1Fe8"
- },
- "source": [
- "##### Copyright 2021 The TensorFlow Authors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "id": "c2jyGuiG1gHr"
- },
- "outputs": [],
- "source": [
- "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "23R0Z9RojXYW"
- },
- "source": [
- "# TFX Estimator Component Tutorial\n",
- "\n",
- "***A Component-by-Component Introduction to TensorFlow Extended (TFX)***"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "LidV2qsXm4XC"
- },
- "source": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wdeKOEkv1Fe8"
+ },
+ "source": [
+ "##### Copyright 2021 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "c2jyGuiG1gHr"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "23R0Z9RojXYW"
+ },
+ "source": [
+ "# TFX Estimator Component Tutorial\n",
+ "\n",
+ "***A Component-by-Component Introduction to TensorFlow Extended (TFX)***"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LidV2qsXm4XC"
+ },
+ "source": [
"Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n",
"\n",
"
\n",
@@ -84,1494 +84,1494 @@
" \n",
"
"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "RBbTLeWmWs8q"
- },
- "source": [
- "\u003e Warning: Estimators are not recommended for new code. Estimators run `v1.Session`-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our [compatibility guarantees](https://tensorflow.org/guide/versions), but will receive no fixes other than security vulnerabilities. See the [migration guide](https://tensorflow.org/guide/migrate) for details."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "KAD1tLoTm_QS"
- },
- "source": [
- "\n",
- "This Colab-based tutorial will interactively walk through each built-in component of TensorFlow Extended (TFX).\n",
- "\n",
- "It covers every step in an end-to-end machine learning pipeline, from data ingestion to pushing a model to serving.\n",
- "\n",
- "When you're done, the contents of this notebook can be automatically exported as TFX pipeline source code, which you can orchestrate with Apache Airflow and Apache Beam.\n",
- "\n",
- "Note: This notebook and its associated APIs are **experimental** and are\n",
- "in active development. Major changes in functionality, behavior, and\n",
- "presentation are expected."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "sfSQ-kX-MLEr"
- },
- "source": [
- "## Background\n",
- "This notebook demonstrates how to use TFX in a Jupyter/Colab environment. Here, we walk through the Chicago Taxi example in an interactive notebook.\n",
- "\n",
- "Working in an interactive notebook is a useful way to become familiar with the structure of a TFX pipeline. It's also useful when doing development of your own pipelines as a lightweight development environment, but you should be aware that there are differences in the way interactive notebooks are orchestrated, and how they access metadata artifacts.\n",
- "\n",
- "### Orchestration\n",
- "\n",
- "In a production deployment of TFX, you will use an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. In an interactive notebook, the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells.\n",
- "\n",
- "### Metadata\n",
- "\n",
- "In a production deployment of TFX, you will access metadata through the ML Metadata (MLMD) API. MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. In an interactive notebook, both properties and payloads are stored in an ephemeral SQLite database in the `/tmp` directory on the Jupyter notebook or Colab server."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2GivNBNYjb3b"
- },
- "source": [
- "## Setup\n",
- "First, we install and import the necessary packages, set up paths, and download data."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "cDl_6DkqJ-pG"
- },
- "source": [
- "### Upgrade Pip\n",
- "\n",
- "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tFhBChv4J_PD"
- },
- "outputs": [],
- "source": [
- "try:\n",
- " import colab\n",
- " !pip install --upgrade pip\n",
- "except:\n",
- " pass"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "MZOYTt1RW4TK"
- },
- "source": [
- "### Install TFX\n",
- "\n",
- "**Note: In Google Colab, because of package updates, the first time you run this cell you must restart the runtime (Runtime \u003e Restart runtime ...).**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "S4SQA7Q5nej3"
- },
- "outputs": [],
- "source": [
- "# TFX has a constraint of 1.16 due to the removal of tf.estimator support.\n",
- "!pip install \"tfx\u003c1.16\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "szPQ2MDYPZ5j"
- },
- "source": [
- "## Did you restart the runtime?\n",
- "\n",
- "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime \u003e Restart runtime ...). This is because of the way that Colab loads packages."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "N-ePgV0Lj68Q"
- },
- "source": [
- "### Import packages\n",
- "We import necessary packages, including standard TFX component classes."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "YIqpWK9efviJ"
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import pprint\n",
- "import tempfile\n",
- "import urllib\n",
- "\n",
- "import absl\n",
- "import tensorflow as tf\n",
- "import tensorflow_model_analysis as tfma\n",
- "tf.get_logger().propagate = False\n",
- "pp = pprint.PrettyPrinter()\n",
- "\n",
- "from tfx import v1 as tfx\n",
- "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
- "\n",
- "%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wCZTHRy0N1D6"
- },
- "source": [
- "Let's check the library versions."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "eZ4K18_DN2D8"
- },
- "outputs": [],
- "source": [
- "print('TensorFlow version: {}'.format(tf.__version__))\n",
- "print('TFX version: {}'.format(tfx.__version__))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ufJKQ6OvkJlY"
- },
- "source": [
- "### Set up pipeline paths"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ad5JLpKbf6sN"
- },
- "outputs": [],
- "source": [
- "# This is the root directory for your TFX pip package installation.\n",
- "_tfx_root = tfx.__path__[0]\n",
- "\n",
- "# This is the directory containing the TFX Chicago Taxi Pipeline example.\n",
- "_taxi_root = os.path.join(_tfx_root, 'examples/chicago_taxi_pipeline')\n",
- "\n",
- "# This is the path where your model will be pushed for serving.\n",
- "_serving_model_dir = os.path.join(\n",
- " tempfile.mkdtemp(), 'serving_model/taxi_simple')\n",
- "\n",
- "# Set up logging.\n",
- "absl.logging.set_verbosity(absl.logging.INFO)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "n2cMMAbSkGfX"
- },
- "source": [
- "### Download example data\n",
- "We download the example dataset for use in our TFX pipeline.\n",
- "\n",
- "The dataset we're using is the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) released by the City of Chicago. The columns in this dataset are:\n",
- "\n",
- "\u003ctable\u003e\n",
- "\u003ctr\u003e\u003ctd\u003epickup_community_area\u003c/td\u003e\u003ctd\u003efare\u003c/td\u003e\u003ctd\u003etrip_start_month\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003etrip_start_hour\u003c/td\u003e\u003ctd\u003etrip_start_day\u003c/td\u003e\u003ctd\u003etrip_start_timestamp\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003epickup_latitude\u003c/td\u003e\u003ctd\u003epickup_longitude\u003c/td\u003e\u003ctd\u003edropoff_latitude\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003edropoff_longitude\u003c/td\u003e\u003ctd\u003etrip_miles\u003c/td\u003e\u003ctd\u003epickup_census_tract\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003edropoff_census_tract\u003c/td\u003e\u003ctd\u003epayment_type\u003c/td\u003e\u003ctd\u003ecompany\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003etrip_seconds\u003c/td\u003e\u003ctd\u003edropoff_community_area\u003c/td\u003e\u003ctd\u003etips\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003c/table\u003e\n",
- "\n",
- "With this dataset, we will build a model that predicts the `tips` of a trip."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "BywX6OUEhAqn"
- },
- "outputs": [],
- "source": [
- "_data_root = tempfile.mkdtemp(prefix='tfx-data')\n",
- "DATA_PATH = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv'\n",
- "_data_filepath = os.path.join(_data_root, \"data.csv\")\n",
- "urllib.request.urlretrieve(DATA_PATH, _data_filepath)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "blZC1sIQOWfH"
- },
- "source": [
- "Take a quick look at the CSV file."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "c5YPeLPFOXaD"
- },
- "outputs": [],
- "source": [
- "!head {_data_filepath}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "QioyhunCImwE"
- },
- "source": [
- "*Disclaimer: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8ONIE_hdkPS4"
- },
- "source": [
- "### Create the InteractiveContext\n",
- "Last, we create an InteractiveContext, which will allow us to run TFX components interactively in this notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "0Rh6K5sUf9dd"
- },
- "outputs": [],
- "source": [
- "# Here, we create an InteractiveContext using default parameters. This will\n",
- "# use a temporary directory with an ephemeral ML Metadata database instance.\n",
- "# To use your own pipeline root or database, the optional properties\n",
- "# `pipeline_root` and `metadata_connection_config` may be passed to\n",
- "# InteractiveContext. Calls to InteractiveContext are no-ops outside of the\n",
- "# notebook.\n",
- "context = InteractiveContext()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HdQWxfsVkzdJ"
- },
- "source": [
- "## Run TFX components interactively\n",
- "In the cells that follow, we create TFX components one-by-one, run each of them, and visualize their output artifacts."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "L9fwt9gQk3BR"
- },
- "source": [
- "### ExampleGen\n",
- "\n",
- "The `ExampleGen` component is usually at the start of a TFX pipeline. It will:\n",
- "\n",
- "1. Split data into training and evaluation sets (by default, 2/3 training + 1/3 eval)\n",
- "2. Convert data into the `tf.Example` format (learn more [here](https://www.tensorflow.org/tutorials/load_data/tfrecord))\n",
- "3. Copy data into the `_tfx_root` directory for other components to access\n",
- "\n",
- "`ExampleGen` takes as input the path to your data source. In our case, this is the `_data_root` path that contains the downloaded CSV.\n",
- "\n",
- "Note: In this notebook, we can instantiate components one-by-one and run them with `InteractiveContext.run()`. By contrast, in a production setting, we would specify all the components upfront in a `Pipeline` to pass to the orchestrator (see the [Building a TFX Pipeline Guide](../../../guide/build_tfx_pipeline))."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "PyXjuMt8f-9u"
- },
- "outputs": [],
- "source": [
- "example_gen = tfx.components.CsvExampleGen(input_base=_data_root)\n",
- "context.run(example_gen)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OqCoZh7KPUm9"
- },
- "source": [
- "Let's examine the output artifacts of `ExampleGen`. This component produces two artifacts, training examples and evaluation examples:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "880KkTAkPeUg"
- },
- "outputs": [],
- "source": [
- "artifact = example_gen.outputs['examples'].get()[0]\n",
- "print(artifact.split_names, artifact.uri)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "J6vcbW_wPqvl"
- },
- "source": [
- "We can also take a look at the first three training examples:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "H4XIXjiCPwzQ"
- },
- "outputs": [],
- "source": [
- "# Get the URI of the output artifact representing the training examples, which is a directory\n",
- "train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')\n",
- "\n",
- "# Get the list of files in this directory (all compressed TFRecord files)\n",
- "tfrecord_filenames = [os.path.join(train_uri, name)\n",
- " for name in os.listdir(train_uri)]\n",
- "\n",
- "# Create a `TFRecordDataset` to read these files\n",
- "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- "\n",
- "# Iterate over the first 3 records and decode them.\n",
- "for tfrecord in dataset.take(3):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2gluYjccf-IP"
- },
- "source": [
- "Now that `ExampleGen` has finished ingesting the data, the next step is data analysis."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "csM6BFhtk5Aa"
- },
- "source": [
- "### StatisticsGen\n",
- "The `StatisticsGen` component computes statistics over your dataset for data analysis, as well as for use in downstream components. It uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "`StatisticsGen` takes as input the dataset we just ingested using `ExampleGen`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MAscCCYWgA-9"
- },
- "outputs": [],
- "source": [
- "statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])\n",
- "context.run(statistics_gen)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HLI6cb_5WugZ"
- },
- "source": [
- "After `StatisticsGen` finishes running, we can visualize the outputted statistics. Try playing with the different plots!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tLjXy7K6Tp_G"
- },
- "outputs": [],
- "source": [
- "context.show(statistics_gen.outputs['statistics'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HLKLTO9Nk60p"
- },
- "source": [
- "### SchemaGen\n",
- "\n",
- "The `SchemaGen` component generates a schema based on your data statistics. (A schema defines the expected bounds, types, and properties of the features in your dataset.) It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ygQvZ6hsiQ_J"
- },
- "outputs": [],
- "source": [
- "schema_gen = tfx.components.SchemaGen(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " infer_feature_shape=False)\n",
- "context.run(schema_gen)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "zi6TxTUKXM6b"
- },
- "source": [
- "After `SchemaGen` finishes running, we can visualize the generated schema as a table."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Ec9vqDXpXeMb"
- },
- "outputs": [],
- "source": [
- "context.show(schema_gen.outputs['schema'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kZWWdbA-m7zp"
- },
- "source": [
- "Each feature in your dataset shows up as a row in the schema table, alongside its properties. The schema also captures all the values that a categorical feature takes on, denoted as its domain.\n",
- "\n",
- "To learn more about schemas, see [the SchemaGen documentation](../../../guide/schemagen)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "V1qcUuO9k9f8"
- },
- "source": [
- "### ExampleValidator\n",
- "The `ExampleValidator` component detects anomalies in your data, based on the expectations defined by the schema. It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "`ExampleValidator` will take as input the statistics from `StatisticsGen`, and the schema from `SchemaGen`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XRlRUuGgiXks"
- },
- "outputs": [],
- "source": [
- "example_validator = tfx.components.ExampleValidator(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " schema=schema_gen.outputs['schema'])\n",
- "context.run(example_validator)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "855mrHgJcoer"
- },
- "source": [
- "After `ExampleValidator` finishes running, we can visualize the anomalies as a table."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "TDyAAozQcrk3"
- },
- "outputs": [],
- "source": [
- "context.show(example_validator.outputs['anomalies'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "znMoJj60ybZx"
- },
- "source": [
- "In the anomalies table, we can see that there are no anomalies. This is what we'd expect, since this the first dataset that we've analyzed and the schema is tailored to it. You should review this schema -- anything unexpected means an anomaly in the data. Once reviewed, the schema can be used to guard future data, and anomalies produced here can be used to debug model performance, understand how your data evolves over time, and identify data errors."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "JPViEz5RlA36"
- },
- "source": [
- "### Transform\n",
- "The `Transform` component performs feature engineering for both training and serving. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library.\n",
- "\n",
- "`Transform` will take as input the data from `ExampleGen`, the schema from `SchemaGen`, as well as a module that contains user-defined Transform code.\n",
- "\n",
- "Let's see an example of user-defined Transform code below (for an introduction to the TensorFlow Transform APIs, [see the tutorial](/tutorials/transform/simple)). First, we define a few constants for feature engineering:\n",
- "\n",
- "Note: The `%%writefile` cell magic will save the contents of the cell as a `.py` file on disk. This allows the `Transform` component to load your code as a module.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "PuNSiUKb4YJf"
- },
- "outputs": [],
- "source": [
- "_taxi_constants_module_file = 'taxi_constants.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "HPjhXuIF4YJh"
- },
- "outputs": [],
- "source": [
- "%%writefile {_taxi_constants_module_file}\n",
- "\n",
- "# Categorical features are assumed to each have a maximum value in the dataset.\n",
- "MAX_CATEGORICAL_FEATURE_VALUES = [24, 31, 12]\n",
- "\n",
- "CATEGORICAL_FEATURE_KEYS = [\n",
- " 'trip_start_hour', 'trip_start_day', 'trip_start_month',\n",
- " 'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',\n",
- " 'dropoff_community_area'\n",
- "]\n",
- "\n",
- "DENSE_FLOAT_FEATURE_KEYS = ['trip_miles', 'fare', 'trip_seconds']\n",
- "\n",
- "# Number of buckets used by tf.transform for encoding each feature.\n",
- "FEATURE_BUCKET_COUNT = 10\n",
- "\n",
- "BUCKET_FEATURE_KEYS = [\n",
- " 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',\n",
- " 'dropoff_longitude'\n",
- "]\n",
- "\n",
- "# Number of vocabulary terms used for encoding VOCAB_FEATURES by tf.transform\n",
- "VOCAB_SIZE = 1000\n",
- "\n",
- "# Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed.\n",
- "OOV_SIZE = 10\n",
- "\n",
- "VOCAB_FEATURE_KEYS = [\n",
- " 'payment_type',\n",
- " 'company',\n",
- "]\n",
- "\n",
- "# Keys\n",
- "LABEL_KEY = 'tips'\n",
- "FARE_KEY = 'fare'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Duj2Ax5z4YJl"
- },
- "source": [
- "Next, we write a `preprocessing_fn` that takes in raw data as input, and returns transformed features that our model can train on:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4AJ9hBs94YJm"
- },
- "outputs": [],
- "source": [
- "_taxi_transform_module_file = 'taxi_transform.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MYmxxx9A4YJn"
- },
- "outputs": [],
- "source": [
- "%%writefile {_taxi_transform_module_file}\n",
- "\n",
- "import tensorflow as tf\n",
- "import tensorflow_transform as tft\n",
- "\n",
- "import taxi_constants\n",
- "\n",
- "_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS\n",
- "_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS\n",
- "_VOCAB_SIZE = taxi_constants.VOCAB_SIZE\n",
- "_OOV_SIZE = taxi_constants.OOV_SIZE\n",
- "_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT\n",
- "_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS\n",
- "_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS\n",
- "_FARE_KEY = taxi_constants.FARE_KEY\n",
- "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
- "\n",
- "\n",
- "def preprocessing_fn(inputs):\n",
- " \"\"\"tf.transform's callback function for preprocessing inputs.\n",
- " Args:\n",
- " inputs: map from feature keys to raw not-yet-transformed features.\n",
- " Returns:\n",
- " Map from string feature key to transformed feature operations.\n",
- " \"\"\"\n",
- " outputs = {}\n",
- " for key in _DENSE_FLOAT_FEATURE_KEYS:\n",
- " # If sparse make it dense, setting nan's to 0 or '', and apply zscore.\n",
- " outputs[key] = tft.scale_to_z_score(\n",
- " _fill_in_missing(inputs[key]))\n",
- "\n",
- " for key in _VOCAB_FEATURE_KEYS:\n",
- " # Build a vocabulary for this feature.\n",
- " outputs[key] = tft.compute_and_apply_vocabulary(\n",
- " _fill_in_missing(inputs[key]),\n",
- " top_k=_VOCAB_SIZE,\n",
- " num_oov_buckets=_OOV_SIZE)\n",
- "\n",
- " for key in _BUCKET_FEATURE_KEYS:\n",
- " outputs[key] = tft.bucketize(\n",
- " _fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT)\n",
- "\n",
- " for key in _CATEGORICAL_FEATURE_KEYS:\n",
- " outputs[key] = _fill_in_missing(inputs[key])\n",
- "\n",
- " # Was this passenger a big tipper?\n",
- " taxi_fare = _fill_in_missing(inputs[_FARE_KEY])\n",
- " tips = _fill_in_missing(inputs[_LABEL_KEY])\n",
- " outputs[_LABEL_KEY] = tf.where(\n",
- " tf.math.is_nan(taxi_fare),\n",
- " tf.cast(tf.zeros_like(taxi_fare), tf.int64),\n",
- " # Test if the tip was \u003e 20% of the fare.\n",
- " tf.cast(\n",
- " tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))\n",
- "\n",
- " return outputs\n",
- "\n",
- "\n",
- "def _fill_in_missing(x):\n",
- " \"\"\"Replace missing values in a SparseTensor.\n",
- " Fills in missing values of `x` with '' or 0, and converts to a dense tensor.\n",
- " Args:\n",
- " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
- " in the second dimension.\n",
- " Returns:\n",
- " A rank 1 tensor where missing values of `x` have been filled in.\n",
- " \"\"\"\n",
- " if not isinstance(x, tf.sparse.SparseTensor):\n",
- " return x\n",
- "\n",
- " default_value = '' if x.dtype == tf.string else 0\n",
- " return tf.squeeze(\n",
- " tf.sparse.to_dense(\n",
- " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
- " default_value),\n",
- " axis=1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wgbmZr3sgbWW"
- },
- "source": [
- "Now, we pass in this feature engineering code to the `Transform` component and run it to transform your data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "jHfhth_GiZI9"
- },
- "outputs": [],
- "source": [
- "transform = tfx.components.Transform(\n",
- " examples=example_gen.outputs['examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " module_file=os.path.abspath(_taxi_transform_module_file))\n",
- "context.run(transform)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fwAwb4rARRQ2"
- },
- "source": [
- "Let's examine the output artifacts of `Transform`. This component produces two types of outputs:\n",
- "\n",
- "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
- "* `transformed_examples` represents the preprocessed training and evaluation data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "SClrAaEGR1O5"
- },
- "outputs": [],
- "source": [
- "transform.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "vyFkBd9AR1sy"
- },
- "source": [
- "Take a peek at the `transform_graph` artifact. It points to a directory containing three subdirectories."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "5tRw4DneR3i7"
- },
- "outputs": [],
- "source": [
- "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
- "os.listdir(train_uri)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "4fqV54CIR6Pu"
- },
- "source": [
- "The `transformed_metadata` subdirectory contains the schema of the preprocessed data. The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data.\n",
- "\n",
- "We can also take a look at the first three transformed examples:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pwbW2zPKR_S4"
- },
- "outputs": [],
- "source": [
- "# Get the URI of the output artifact representing the transformed examples, which is a directory\n",
- "train_uri = os.path.join(transform.outputs['transformed_examples'].get()[0].uri, 'Split-train')\n",
- "\n",
- "# Get the list of files in this directory (all compressed TFRecord files)\n",
- "tfrecord_filenames = [os.path.join(train_uri, name)\n",
- " for name in os.listdir(train_uri)]\n",
- "\n",
- "# Create a `TFRecordDataset` to read these files\n",
- "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- "\n",
- "# Iterate over the first 3 records and decode them.\n",
- "for tfrecord in dataset.take(3):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "q_b_V6eN4f69"
- },
- "source": [
- "After the `Transform` component has transformed your data into features, and the next step is to train a model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OBJFtnl6lCg9"
- },
- "source": [
- "### Trainer\n",
- "The `Trainer` component will train a model that you define in TensorFlow (either using the Estimator API or the Keras API with [`model_to_estimator`](https://www.tensorflow.org/api_docs/python/tf/keras/estimator/model_to_estimator)).\n",
- "\n",
- "`Trainer` takes as input the schema from `SchemaGen`, the transformed data and graph from `Transform`, training parameters, as well as a module that contains user-defined model code.\n",
- "\n",
- "Let's see an example of user-defined model code below (for an introduction to the TensorFlow Estimator APIs, [see the tutorial](https://www.tensorflow.org/tutorials/estimator/premade)):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "N1376oq04YJt"
- },
- "outputs": [],
- "source": [
- "_taxi_trainer_module_file = 'taxi_trainer.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "nf9UuNng4YJu"
- },
- "outputs": [],
- "source": [
- "%%writefile {_taxi_trainer_module_file}\n",
- "\n",
- "import tensorflow as tf\n",
- "import tensorflow_model_analysis as tfma\n",
- "import tensorflow_transform as tft\n",
- "from tensorflow_transform.tf_metadata import schema_utils\n",
- "from tfx_bsl.tfxio import dataset_options\n",
- "\n",
- "import taxi_constants\n",
- "\n",
- "_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS\n",
- "_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS\n",
- "_VOCAB_SIZE = taxi_constants.VOCAB_SIZE\n",
- "_OOV_SIZE = taxi_constants.OOV_SIZE\n",
- "_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT\n",
- "_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS\n",
- "_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS\n",
- "_MAX_CATEGORICAL_FEATURE_VALUES = taxi_constants.MAX_CATEGORICAL_FEATURE_VALUES\n",
- "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
- "\n",
- "\n",
- "# Tf.Transform considers these features as \"raw\"\n",
- "def _get_raw_feature_spec(schema):\n",
- " return schema_utils.schema_as_feature_spec(schema).feature_spec\n",
- "\n",
- "\n",
- "def _build_estimator(config, hidden_units=None, warm_start_from=None):\n",
- " \"\"\"Build an estimator for predicting the tipping behavior of taxi riders.\n",
- " Args:\n",
- " config: tf.estimator.RunConfig defining the runtime environment for the\n",
- " estimator (including model_dir).\n",
- " hidden_units: [int], the layer sizes of the DNN (input layer first)\n",
- " warm_start_from: Optional directory to warm start from.\n",
- " Returns:\n",
- " A dict of the following:\n",
- " - estimator: The estimator that will be used for training and eval.\n",
- " - train_spec: Spec for training.\n",
- " - eval_spec: Spec for eval.\n",
- " - eval_input_receiver_fn: Input function for eval.\n",
- " \"\"\"\n",
- " real_valued_columns = [\n",
- " tf.feature_column.numeric_column(key, shape=())\n",
- " for key in _DENSE_FLOAT_FEATURE_KEYS\n",
- " ]\n",
- " categorical_columns = [\n",
- " tf.feature_column.categorical_column_with_identity(\n",
- " key, num_buckets=_VOCAB_SIZE + _OOV_SIZE, default_value=0)\n",
- " for key in _VOCAB_FEATURE_KEYS\n",
- " ]\n",
- " categorical_columns += [\n",
- " tf.feature_column.categorical_column_with_identity(\n",
- " key, num_buckets=_FEATURE_BUCKET_COUNT, default_value=0)\n",
- " for key in _BUCKET_FEATURE_KEYS\n",
- " ]\n",
- " categorical_columns += [\n",
- " tf.feature_column.categorical_column_with_identity( # pylint: disable=g-complex-comprehension\n",
- " key,\n",
- " num_buckets=num_buckets,\n",
- " default_value=0) for key, num_buckets in zip(\n",
- " _CATEGORICAL_FEATURE_KEYS,\n",
- " _MAX_CATEGORICAL_FEATURE_VALUES)\n",
- " ]\n",
- " return tf.estimator.DNNLinearCombinedClassifier(\n",
- " config=config,\n",
- " linear_feature_columns=categorical_columns,\n",
- " dnn_feature_columns=real_valued_columns,\n",
- " dnn_hidden_units=hidden_units or [100, 70, 50, 25],\n",
- " warm_start_from=warm_start_from)\n",
- "\n",
- "\n",
- "def _example_serving_receiver_fn(tf_transform_graph, schema):\n",
- " \"\"\"Build the serving in inputs.\n",
- " Args:\n",
- " tf_transform_graph: A TFTransformOutput.\n",
- " schema: the schema of the input data.\n",
- " Returns:\n",
- " Tensorflow graph which parses examples, applying tf-transform to them.\n",
- " \"\"\"\n",
- " raw_feature_spec = _get_raw_feature_spec(schema)\n",
- " raw_feature_spec.pop(_LABEL_KEY)\n",
- "\n",
- " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
- " raw_feature_spec, default_batch_size=None)\n",
- " serving_input_receiver = raw_input_fn()\n",
- "\n",
- " transformed_features = tf_transform_graph.transform_raw_features(\n",
- " serving_input_receiver.features)\n",
- "\n",
- " return tf.estimator.export.ServingInputReceiver(\n",
- " transformed_features, serving_input_receiver.receiver_tensors)\n",
- "\n",
- "\n",
- "def _eval_input_receiver_fn(tf_transform_graph, schema):\n",
- " \"\"\"Build everything needed for the tf-model-analysis to run the model.\n",
- " Args:\n",
- " tf_transform_graph: A TFTransformOutput.\n",
- " schema: the schema of the input data.\n",
- " Returns:\n",
- " EvalInputReceiver function, which contains:\n",
- " - Tensorflow graph which parses raw untransformed features, applies the\n",
- " tf-transform preprocessing operators.\n",
- " - Set of raw, untransformed features.\n",
- " - Label against which predictions will be compared.\n",
- " \"\"\"\n",
- " # Notice that the inputs are raw features, not transformed features here.\n",
- " raw_feature_spec = _get_raw_feature_spec(schema)\n",
- "\n",
- " serialized_tf_example = tf.compat.v1.placeholder(\n",
- " dtype=tf.string, shape=[None], name='input_example_tensor')\n",
- "\n",
- " # Add a parse_example operator to the tensorflow graph, which will parse\n",
- " # raw, untransformed, tf examples.\n",
- " features = tf.io.parse_example(serialized_tf_example, raw_feature_spec)\n",
- "\n",
- " # Now that we have our raw examples, process them through the tf-transform\n",
- " # function computed during the preprocessing step.\n",
- " transformed_features = tf_transform_graph.transform_raw_features(\n",
- " features)\n",
- "\n",
- " # The key name MUST be 'examples'.\n",
- " receiver_tensors = {'examples': serialized_tf_example}\n",
- "\n",
- " # NOTE: Model is driven by transformed features (since training works on the\n",
- " # materialized output of TFT, but slicing will happen on raw features.\n",
- " features.update(transformed_features)\n",
- "\n",
- " return tfma.export.EvalInputReceiver(\n",
- " features=features,\n",
- " receiver_tensors=receiver_tensors,\n",
- " labels=transformed_features[_LABEL_KEY])\n",
- "\n",
- "\n",
- "def _input_fn(file_pattern, data_accessor, tf_transform_output, batch_size=200):\n",
- " \"\"\"Generates features and label for tuning/training.\n",
- "\n",
- " Args:\n",
- " file_pattern: List of paths or patterns of input tfrecord files.\n",
- " data_accessor: DataAccessor for converting input to RecordBatch.\n",
- " tf_transform_output: A TFTransformOutput.\n",
- " batch_size: representing the number of consecutive elements of returned\n",
- " dataset to combine in a single batch\n",
- "\n",
- " Returns:\n",
- " A dataset that contains (features, indices) tuple where features is a\n",
- " dictionary of Tensors, and indices is a single Tensor of label indices.\n",
- " \"\"\"\n",
- " return data_accessor.tf_dataset_factory(\n",
- " file_pattern,\n",
- " dataset_options.TensorFlowDatasetOptions(\n",
- " batch_size=batch_size, label_key=_LABEL_KEY),\n",
- " tf_transform_output.transformed_metadata.schema)\n",
- "\n",
- "\n",
- "# TFX will call this function\n",
- "def trainer_fn(trainer_fn_args, schema):\n",
- " \"\"\"Build the estimator using the high level API.\n",
- " Args:\n",
- " trainer_fn_args: Holds args used to train the model as name/value pairs.\n",
- " schema: Holds the schema of the training examples.\n",
- " Returns:\n",
- " A dict of the following:\n",
- " - estimator: The estimator that will be used for training and eval.\n",
- " - train_spec: Spec for training.\n",
- " - eval_spec: Spec for eval.\n",
- " - eval_input_receiver_fn: Input function for eval.\n",
- " \"\"\"\n",
- " # Number of nodes in the first layer of the DNN\n",
- " first_dnn_layer_size = 100\n",
- " num_dnn_layers = 4\n",
- " dnn_decay_factor = 0.7\n",
- "\n",
- " train_batch_size = 40\n",
- " eval_batch_size = 40\n",
- "\n",
- " tf_transform_graph = tft.TFTransformOutput(trainer_fn_args.transform_output)\n",
- "\n",
- " train_input_fn = lambda: _input_fn( # pylint: disable=g-long-lambda\n",
- " trainer_fn_args.train_files,\n",
- " trainer_fn_args.data_accessor,\n",
- " tf_transform_graph,\n",
- " batch_size=train_batch_size)\n",
- "\n",
- " eval_input_fn = lambda: _input_fn( # pylint: disable=g-long-lambda\n",
- " trainer_fn_args.eval_files,\n",
- " trainer_fn_args.data_accessor,\n",
- " tf_transform_graph,\n",
- " batch_size=eval_batch_size)\n",
- "\n",
- " train_spec = tf.estimator.TrainSpec( # pylint: disable=g-long-lambda\n",
- " train_input_fn,\n",
- " max_steps=trainer_fn_args.train_steps)\n",
- "\n",
- " serving_receiver_fn = lambda: _example_serving_receiver_fn( # pylint: disable=g-long-lambda\n",
- " tf_transform_graph, schema)\n",
- "\n",
- " exporter = tf.estimator.FinalExporter('chicago-taxi', serving_receiver_fn)\n",
- " eval_spec = tf.estimator.EvalSpec(\n",
- " eval_input_fn,\n",
- " steps=trainer_fn_args.eval_steps,\n",
- " exporters=[exporter],\n",
- " name='chicago-taxi-eval')\n",
- "\n",
- " run_config = tf.estimator.RunConfig(\n",
- " save_checkpoints_steps=999, keep_checkpoint_max=1)\n",
- "\n",
- " run_config = run_config.replace(model_dir=trainer_fn_args.serving_model_dir)\n",
- "\n",
- " estimator = _build_estimator(\n",
- " # Construct layers sizes with exponetial decay\n",
- " hidden_units=[\n",
- " max(2, int(first_dnn_layer_size * dnn_decay_factor**i))\n",
- " for i in range(num_dnn_layers)\n",
- " ],\n",
- " config=run_config,\n",
- " warm_start_from=trainer_fn_args.base_model)\n",
- "\n",
- " # Create an input receiver for TFMA processing\n",
- " receiver_fn = lambda: _eval_input_receiver_fn( # pylint: disable=g-long-lambda\n",
- " tf_transform_graph, schema)\n",
- "\n",
- " return {\n",
- " 'estimator': estimator,\n",
- " 'train_spec': train_spec,\n",
- " 'eval_spec': eval_spec,\n",
- " 'eval_input_receiver_fn': receiver_fn\n",
- " }"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GY4yTRaX4YJx"
- },
- "source": [
- "Now, we pass in this model code to the `Trainer` component and run it to train the model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "429-vvCWibO0"
- },
- "outputs": [],
- "source": [
- "from tfx.components.trainer.executor import Executor\n",
- "from tfx.dsl.components.base import executor_spec\n",
- "\n",
- "trainer = tfx.components.Trainer(\n",
- " module_file=os.path.abspath(_taxi_trainer_module_file),\n",
- " custom_executor_spec=executor_spec.ExecutorClassSpec(Executor),\n",
- " examples=transform.outputs['transformed_examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " transform_graph=transform.outputs['transform_graph'],\n",
- " train_args=tfx.proto.TrainArgs(num_steps=10000),\n",
- " eval_args=tfx.proto.EvalArgs(num_steps=5000))\n",
- "context.run(trainer)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "6Cql1G35StJp"
- },
- "source": [
- "#### Analyze Training with TensorBoard\n",
- "Optionally, we can connect TensorBoard to the Trainer to analyze our model's training curves."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "bXe62WE0S0Ek"
- },
- "outputs": [],
- "source": [
- "# Get the URI of the output artifact representing the training logs, which is a directory\n",
- "model_run_dir = trainer.outputs['model_run'].get()[0].uri\n",
- "\n",
- "%load_ext tensorboard\n",
- "%tensorboard --logdir {model_run_dir}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FmPftrv0lEQy"
- },
- "source": [
- "### Evaluator\n",
- "The `Evaluator` component computes model performance metrics over the evaluation set. It uses the [TensorFlow Model Analysis](https://www.tensorflow.org/tfx/model_analysis/get_started) library. The `Evaluator` can also optionally validate that a newly trained model is better than the previous model. This is useful in a production pipeline setting where you may automatically train and validate a model every day. In this notebook, we only train one model, so the `Evaluator` automatically will label the model as \"good\".\n",
- "\n",
- "`Evaluator` will take as input the data from `ExampleGen`, the trained model from `Trainer`, and slicing configuration. The slicing configuration allows you to slice your metrics on feature values (e.g. how does your model perform on taxi trips that start at 8am versus 8pm?). See an example of this configuration below:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "fVhfzzh9PDEx"
- },
- "outputs": [],
- "source": [
- "eval_config = tfma.EvalConfig(\n",
- " model_specs=[\n",
- " # Using signature 'eval' implies the use of an EvalSavedModel. To use\n",
- " # a serving model remove the signature to defaults to 'serving_default'\n",
- " # and add a label_key.\n",
- " tfma.ModelSpec(signature_name='eval')\n",
- " ],\n",
- " metrics_specs=[\n",
- " tfma.MetricsSpec(\n",
- " # The metrics added here are in addition to those saved with the\n",
- " # model (assuming either a keras model or EvalSavedModel is used).\n",
- " # Any metrics added into the saved model (for example using\n",
- " # model.compile(..., metrics=[...]), etc) will be computed\n",
- " # automatically.\n",
- " metrics=[\n",
- " tfma.MetricConfig(class_name='ExampleCount')\n",
- " ],\n",
- " # To add validation thresholds for metrics saved with the model,\n",
- " # add them keyed by metric name to the thresholds map.\n",
- " thresholds = {\n",
- " 'accuracy': tfma.MetricThreshold(\n",
- " value_threshold=tfma.GenericValueThreshold(\n",
- " lower_bound={'value': 0.5}),\n",
- " # Change threshold will be ignored if there is no\n",
- " # baseline model resolved from MLMD (first run).\n",
- " change_threshold=tfma.GenericChangeThreshold(\n",
- " direction=tfma.MetricDirection.HIGHER_IS_BETTER,\n",
- " absolute={'value': -1e-10}))\n",
- " }\n",
- " )\n",
- " ],\n",
- " slicing_specs=[\n",
- " # An empty slice spec means the overall slice, i.e. the whole dataset.\n",
- " tfma.SlicingSpec(),\n",
- " # Data can be sliced along a feature column. In this case, data is\n",
- " # sliced along feature column trip_start_hour.\n",
- " tfma.SlicingSpec(feature_keys=['trip_start_hour'])\n",
- " ])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "9mBdKH1F8JuT"
- },
- "source": [
- "Next, we give this configuration to `Evaluator` and run it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Zjcx8g6mihSt"
- },
- "outputs": [],
- "source": [
- "# Use TFMA to compute a evaluation statistics over features of a model and\n",
- "# validate them against a baseline.\n",
- "\n",
- "# The model resolver is only required if performing model validation in addition\n",
- "# to evaluation. In this case we validate against the latest blessed model. If\n",
- "# no model has been blessed before (as in this case) the evaluator will make our\n",
- "# candidate the first blessed model.\n",
- "model_resolver = tfx.dsl.Resolver(\n",
- " strategy_class=tfx.dsl.experimental.LatestBlessedModelStrategy,\n",
- " model=tfx.dsl.Channel(type=tfx.types.standard_artifacts.Model),\n",
- " model_blessing=tfx.dsl.Channel(\n",
- " type=tfx.types.standard_artifacts.ModelBlessing)).with_id(\n",
- " 'latest_blessed_model_resolver')\n",
- "context.run(model_resolver)\n",
- "\n",
- "evaluator = tfx.components.Evaluator(\n",
- " examples=example_gen.outputs['examples'],\n",
- " model=trainer.outputs['model'],\n",
- " eval_config=eval_config)\n",
- "context.run(evaluator)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "AeCVkBusS_8g"
- },
- "source": [
- "Now let's examine the output artifacts of `Evaluator`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "k4GghePOTJxL"
- },
- "outputs": [],
- "source": [
- "evaluator.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Y5TMskWe9LL0"
- },
- "source": [
- "Using the `evaluation` output we can show the default visualization of global metrics on the entire evaluation set."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "U729j5X5QQUQ"
- },
- "outputs": [],
- "source": [
- "context.show(evaluator.outputs['evaluation'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "t-tI4p6m-OAn"
- },
- "source": [
- "To see the visualization for sliced evaluation metrics, we can directly call the TensorFlow Model Analysis library."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pyis6iy0HLdi"
- },
- "outputs": [],
- "source": [
- "import tensorflow_model_analysis as tfma\n",
- "\n",
- "# Get the TFMA output result path and load the result.\n",
- "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
- "tfma_result = tfma.load_eval_result(PATH_TO_RESULT)\n",
- "\n",
- "# Show data sliced along feature column trip_start_hour.\n",
- "tfma.view.render_slicing_metrics(\n",
- " tfma_result, slicing_column='trip_start_hour')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "7uvYrUf2-r_6"
- },
- "source": [
- "This visualization shows the same metrics, but computed at every feature value of `trip_start_hour` instead of on the entire evaluation set.\n",
- "\n",
- "TensorFlow Model Analysis supports many other visualizations, such as Fairness Indicators and plotting a time series of model performance. To learn more, see [the tutorial](/tutorials/model_analysis/tfma_basic)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TEotnkxEswUb"
- },
- "source": [
- "Since we added thresholds to our config, validation output is also available. The precence of a `blessing` artifact indicates that our model passed validation. Since this is the first validation being performed the candidate is automatically blessed."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "FZmiRtg6TKtR"
- },
- "outputs": [],
- "source": [
- "blessing_uri = evaluator.outputs['blessing'].get()[0].uri\n",
- "!ls -l {blessing_uri}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "hM1tFkOVSBa0"
- },
- "source": [
- "Now can also verify the success by loading the validation result record:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "lxa5G08bSJ8a"
- },
- "outputs": [],
- "source": [
- "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
- "print(tfma.load_validation_result(PATH_TO_RESULT))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "T8DYekCZlHfj"
- },
- "source": [
- "### Pusher\n",
- "The `Pusher` component is usually at the end of a TFX pipeline. It checks whether a model has passed validation, and if so, exports the model to `_serving_model_dir`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "r45nQ69eikc9"
- },
- "outputs": [],
- "source": [
- "pusher = tfx.components.Pusher(\n",
- " model=trainer.outputs['model'],\n",
- " model_blessing=evaluator.outputs['blessing'],\n",
- " push_destination=tfx.proto.PushDestination(\n",
- " filesystem=tfx.proto.PushDestination.Filesystem(\n",
- " base_directory=_serving_model_dir)))\n",
- "context.run(pusher)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ctUErBYoTO9I"
- },
- "source": [
- "Let's examine the output artifacts of `Pusher`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pRkWo-MzTSss"
- },
- "outputs": [],
- "source": [
- "pusher.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "peH2PPS3VgkL"
- },
- "source": [
- "In particular, the Pusher will export your model in the SavedModel format, which looks like this:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4zyIqWl9TSdG"
- },
- "outputs": [],
- "source": [
- "push_uri = pusher.outputs['pushed_model'].get()[0].uri\n",
- "model = tf.saved_model.load(push_uri)\n",
- "\n",
- "for item in model.signatures.items():\n",
- " pp.pprint(item)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3-YPNUuHANtj"
- },
- "source": [
- "We're finished our tour of built-in TFX components!"
- ]
- }
- ],
- "metadata": {
- "accelerator": "GPU",
- "colab": {
- "collapsed_sections": [
- "wdeKOEkv1Fe8"
- ],
- "name": "components.ipynb",
- "private_outputs": true,
- "provenance": [],
- "toc_visible": true
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RBbTLeWmWs8q"
+ },
+ "source": [
+ "> Warning: Estimators are not recommended for new code. Estimators run `v1.Session`-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our [compatibility guarantees](https://tensorflow.org/guide/versions), but will receive no fixes other than security vulnerabilities. See the [migration guide](https://tensorflow.org/guide/migrate) for details."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KAD1tLoTm_QS"
+ },
+ "source": [
+ "\n",
+ "This Colab-based tutorial will interactively walk through each built-in component of TensorFlow Extended (TFX).\n",
+ "\n",
+ "It covers every step in an end-to-end machine learning pipeline, from data ingestion to pushing a model to serving.\n",
+ "\n",
+ "When you're done, the contents of this notebook can be automatically exported as TFX pipeline source code, which you can orchestrate with Apache Airflow and Apache Beam.\n",
+ "\n",
+ "Note: This notebook and its associated APIs are **experimental** and are\n",
+ "in active development. Major changes in functionality, behavior, and\n",
+ "presentation are expected."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sfSQ-kX-MLEr"
+ },
+ "source": [
+ "## Background\n",
+ "This notebook demonstrates how to use TFX in a Jupyter/Colab environment. Here, we walk through the Chicago Taxi example in an interactive notebook.\n",
+ "\n",
+ "Working in an interactive notebook is a useful way to become familiar with the structure of a TFX pipeline. It's also useful when doing development of your own pipelines as a lightweight development environment, but you should be aware that there are differences in the way interactive notebooks are orchestrated, and how they access metadata artifacts.\n",
+ "\n",
+ "### Orchestration\n",
+ "\n",
+ "In a production deployment of TFX, you will use an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. In an interactive notebook, the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells.\n",
+ "\n",
+ "### Metadata\n",
+ "\n",
+ "In a production deployment of TFX, you will access metadata through the ML Metadata (MLMD) API. MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. In an interactive notebook, both properties and payloads are stored in an ephemeral SQLite database in the `/tmp` directory on the Jupyter notebook or Colab server."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2GivNBNYjb3b"
+ },
+ "source": [
+ "## Setup\n",
+ "First, we install and import the necessary packages, set up paths, and download data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cDl_6DkqJ-pG"
+ },
+ "source": [
+ "### Upgrade Pip\n",
+ "\n",
+ "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tFhBChv4J_PD"
+ },
+ "outputs": [],
+ "source": [
+ "try:\n",
+ " import colab\n",
+ " !pip install --upgrade pip\n",
+ "except:\n",
+ " pass"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MZOYTt1RW4TK"
+ },
+ "source": [
+ "### Install TFX\n",
+ "\n",
+ "**Note: In Google Colab, because of package updates, the first time you run this cell you must restart the runtime (Runtime > Restart runtime ...).**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "S4SQA7Q5nej3"
+ },
+ "outputs": [],
+ "source": [
+ "# TFX has a constraint of 1.16 due to the removal of tf.estimator support.\n",
+ "!pip install \"tfx<1.16\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "szPQ2MDYPZ5j"
+ },
+ "source": [
+ "## Did you restart the runtime?\n",
+ "\n",
+ "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime > Restart runtime ...). This is because of the way that Colab loads packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N-ePgV0Lj68Q"
+ },
+ "source": [
+ "### Import packages\n",
+ "We import necessary packages, including standard TFX component classes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "YIqpWK9efviJ"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import pprint\n",
+ "import tempfile\n",
+ "import urllib\n",
+ "\n",
+ "import absl\n",
+ "import tensorflow as tf\n",
+ "import tensorflow_model_analysis as tfma\n",
+ "tf.get_logger().propagate = False\n",
+ "pp = pprint.PrettyPrinter()\n",
+ "\n",
+ "from tfx import v1 as tfx\n",
+ "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
+ "\n",
+ "%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wCZTHRy0N1D6"
+ },
+ "source": [
+ "Let's check the library versions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "eZ4K18_DN2D8"
+ },
+ "outputs": [],
+ "source": [
+ "print('TensorFlow version: {}'.format(tf.__version__))\n",
+ "print('TFX version: {}'.format(tfx.__version__))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ufJKQ6OvkJlY"
+ },
+ "source": [
+ "### Set up pipeline paths"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ad5JLpKbf6sN"
+ },
+ "outputs": [],
+ "source": [
+ "# This is the root directory for your TFX pip package installation.\n",
+ "_tfx_root = tfx.__path__[0]\n",
+ "\n",
+ "# This is the directory containing the TFX Chicago Taxi Pipeline example.\n",
+ "_taxi_root = os.path.join(_tfx_root, 'examples/chicago_taxi_pipeline')\n",
+ "\n",
+ "# This is the path where your model will be pushed for serving.\n",
+ "_serving_model_dir = os.path.join(\n",
+ " tempfile.mkdtemp(), 'serving_model/taxi_simple')\n",
+ "\n",
+ "# Set up logging.\n",
+ "absl.logging.set_verbosity(absl.logging.INFO)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "n2cMMAbSkGfX"
+ },
+ "source": [
+ "### Download example data\n",
+ "We download the example dataset for use in our TFX pipeline.\n",
+ "\n",
+ "The dataset we're using is the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) released by the City of Chicago. The columns in this dataset are:\n",
+ "\n",
+ "\n",
+ "pickup_community_area | fare | trip_start_month |
\n",
+ "trip_start_hour | trip_start_day | trip_start_timestamp |
\n",
+ "pickup_latitude | pickup_longitude | dropoff_latitude |
\n",
+ "dropoff_longitude | trip_miles | pickup_census_tract |
\n",
+ "dropoff_census_tract | payment_type | company |
\n",
+ "trip_seconds | dropoff_community_area | tips |
\n",
+ "
\n",
+ "\n",
+ "With this dataset, we will build a model that predicts the `tips` of a trip."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "BywX6OUEhAqn"
+ },
+ "outputs": [],
+ "source": [
+ "_data_root = tempfile.mkdtemp(prefix='tfx-data')\n",
+ "DATA_PATH = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv'\n",
+ "_data_filepath = os.path.join(_data_root, \"data.csv\")\n",
+ "urllib.request.urlretrieve(DATA_PATH, _data_filepath)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "blZC1sIQOWfH"
+ },
+ "source": [
+ "Take a quick look at the CSV file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "c5YPeLPFOXaD"
+ },
+ "outputs": [],
+ "source": [
+ "!head {_data_filepath}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QioyhunCImwE"
+ },
+ "source": [
+ "*Disclaimer: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8ONIE_hdkPS4"
+ },
+ "source": [
+ "### Create the InteractiveContext\n",
+ "Last, we create an InteractiveContext, which will allow us to run TFX components interactively in this notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "0Rh6K5sUf9dd"
+ },
+ "outputs": [],
+ "source": [
+ "# Here, we create an InteractiveContext using default parameters. This will\n",
+ "# use a temporary directory with an ephemeral ML Metadata database instance.\n",
+ "# To use your own pipeline root or database, the optional properties\n",
+ "# `pipeline_root` and `metadata_connection_config` may be passed to\n",
+ "# InteractiveContext. Calls to InteractiveContext are no-ops outside of the\n",
+ "# notebook.\n",
+ "context = InteractiveContext()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HdQWxfsVkzdJ"
+ },
+ "source": [
+ "## Run TFX components interactively\n",
+ "In the cells that follow, we create TFX components one-by-one, run each of them, and visualize their output artifacts."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L9fwt9gQk3BR"
+ },
+ "source": [
+ "### ExampleGen\n",
+ "\n",
+ "The `ExampleGen` component is usually at the start of a TFX pipeline. It will:\n",
+ "\n",
+ "1. Split data into training and evaluation sets (by default, 2/3 training + 1/3 eval)\n",
+ "2. Convert data into the `tf.Example` format (learn more [here](https://www.tensorflow.org/tutorials/load_data/tfrecord))\n",
+ "3. Copy data into the `_tfx_root` directory for other components to access\n",
+ "\n",
+ "`ExampleGen` takes as input the path to your data source. In our case, this is the `_data_root` path that contains the downloaded CSV.\n",
+ "\n",
+ "Note: In this notebook, we can instantiate components one-by-one and run them with `InteractiveContext.run()`. By contrast, in a production setting, we would specify all the components upfront in a `Pipeline` to pass to the orchestrator (see the [Building a TFX Pipeline Guide](../../../guide/build_tfx_pipeline))."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "PyXjuMt8f-9u"
+ },
+ "outputs": [],
+ "source": [
+ "example_gen = tfx.components.CsvExampleGen(input_base=_data_root)\n",
+ "context.run(example_gen)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OqCoZh7KPUm9"
+ },
+ "source": [
+ "Let's examine the output artifacts of `ExampleGen`. This component produces two artifacts, training examples and evaluation examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "880KkTAkPeUg"
+ },
+ "outputs": [],
+ "source": [
+ "artifact = example_gen.outputs['examples'].get()[0]\n",
+ "print(artifact.split_names, artifact.uri)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J6vcbW_wPqvl"
+ },
+ "source": [
+ "We can also take a look at the first three training examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "H4XIXjiCPwzQ"
+ },
+ "outputs": [],
+ "source": [
+ "# Get the URI of the output artifact representing the training examples, which is a directory\n",
+ "train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')\n",
+ "\n",
+ "# Get the list of files in this directory (all compressed TFRecord files)\n",
+ "tfrecord_filenames = [os.path.join(train_uri, name)\n",
+ " for name in os.listdir(train_uri)]\n",
+ "\n",
+ "# Create a `TFRecordDataset` to read these files\n",
+ "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ "\n",
+ "# Iterate over the first 3 records and decode them.\n",
+ "for tfrecord in dataset.take(3):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2gluYjccf-IP"
+ },
+ "source": [
+ "Now that `ExampleGen` has finished ingesting the data, the next step is data analysis."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "csM6BFhtk5Aa"
+ },
+ "source": [
+ "### StatisticsGen\n",
+ "The `StatisticsGen` component computes statistics over your dataset for data analysis, as well as for use in downstream components. It uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "`StatisticsGen` takes as input the dataset we just ingested using `ExampleGen`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MAscCCYWgA-9"
+ },
+ "outputs": [],
+ "source": [
+ "statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])\n",
+ "context.run(statistics_gen)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HLI6cb_5WugZ"
+ },
+ "source": [
+ "After `StatisticsGen` finishes running, we can visualize the outputted statistics. Try playing with the different plots!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tLjXy7K6Tp_G"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(statistics_gen.outputs['statistics'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HLKLTO9Nk60p"
+ },
+ "source": [
+ "### SchemaGen\n",
+ "\n",
+ "The `SchemaGen` component generates a schema based on your data statistics. (A schema defines the expected bounds, types, and properties of the features in your dataset.) It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ygQvZ6hsiQ_J"
+ },
+ "outputs": [],
+ "source": [
+ "schema_gen = tfx.components.SchemaGen(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " infer_feature_shape=False)\n",
+ "context.run(schema_gen)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zi6TxTUKXM6b"
+ },
+ "source": [
+ "After `SchemaGen` finishes running, we can visualize the generated schema as a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Ec9vqDXpXeMb"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(schema_gen.outputs['schema'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kZWWdbA-m7zp"
+ },
+ "source": [
+ "Each feature in your dataset shows up as a row in the schema table, alongside its properties. The schema also captures all the values that a categorical feature takes on, denoted as its domain.\n",
+ "\n",
+ "To learn more about schemas, see [the SchemaGen documentation](../../../guide/schemagen)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "V1qcUuO9k9f8"
+ },
+ "source": [
+ "### ExampleValidator\n",
+ "The `ExampleValidator` component detects anomalies in your data, based on the expectations defined by the schema. It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "`ExampleValidator` will take as input the statistics from `StatisticsGen`, and the schema from `SchemaGen`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XRlRUuGgiXks"
+ },
+ "outputs": [],
+ "source": [
+ "example_validator = tfx.components.ExampleValidator(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " schema=schema_gen.outputs['schema'])\n",
+ "context.run(example_validator)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "855mrHgJcoer"
+ },
+ "source": [
+ "After `ExampleValidator` finishes running, we can visualize the anomalies as a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "TDyAAozQcrk3"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(example_validator.outputs['anomalies'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "znMoJj60ybZx"
+ },
+ "source": [
+ "In the anomalies table, we can see that there are no anomalies. This is what we'd expect, since this the first dataset that we've analyzed and the schema is tailored to it. You should review this schema -- anything unexpected means an anomaly in the data. Once reviewed, the schema can be used to guard future data, and anomalies produced here can be used to debug model performance, understand how your data evolves over time, and identify data errors."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JPViEz5RlA36"
+ },
+ "source": [
+ "### Transform\n",
+ "The `Transform` component performs feature engineering for both training and serving. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library.\n",
+ "\n",
+ "`Transform` will take as input the data from `ExampleGen`, the schema from `SchemaGen`, as well as a module that contains user-defined Transform code.\n",
+ "\n",
+ "Let's see an example of user-defined Transform code below (for an introduction to the TensorFlow Transform APIs, [see the tutorial](/tutorials/transform/simple)). First, we define a few constants for feature engineering:\n",
+ "\n",
+ "Note: The `%%writefile` cell magic will save the contents of the cell as a `.py` file on disk. This allows the `Transform` component to load your code as a module.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "PuNSiUKb4YJf"
+ },
+ "outputs": [],
+ "source": [
+ "_taxi_constants_module_file = 'taxi_constants.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "HPjhXuIF4YJh"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_taxi_constants_module_file}\n",
+ "\n",
+ "# Categorical features are assumed to each have a maximum value in the dataset.\n",
+ "MAX_CATEGORICAL_FEATURE_VALUES = [24, 31, 12]\n",
+ "\n",
+ "CATEGORICAL_FEATURE_KEYS = [\n",
+ " 'trip_start_hour', 'trip_start_day', 'trip_start_month',\n",
+ " 'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',\n",
+ " 'dropoff_community_area'\n",
+ "]\n",
+ "\n",
+ "DENSE_FLOAT_FEATURE_KEYS = ['trip_miles', 'fare', 'trip_seconds']\n",
+ "\n",
+ "# Number of buckets used by tf.transform for encoding each feature.\n",
+ "FEATURE_BUCKET_COUNT = 10\n",
+ "\n",
+ "BUCKET_FEATURE_KEYS = [\n",
+ " 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',\n",
+ " 'dropoff_longitude'\n",
+ "]\n",
+ "\n",
+ "# Number of vocabulary terms used for encoding VOCAB_FEATURES by tf.transform\n",
+ "VOCAB_SIZE = 1000\n",
+ "\n",
+ "# Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed.\n",
+ "OOV_SIZE = 10\n",
+ "\n",
+ "VOCAB_FEATURE_KEYS = [\n",
+ " 'payment_type',\n",
+ " 'company',\n",
+ "]\n",
+ "\n",
+ "# Keys\n",
+ "LABEL_KEY = 'tips'\n",
+ "FARE_KEY = 'fare'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Duj2Ax5z4YJl"
+ },
+ "source": [
+ "Next, we write a `preprocessing_fn` that takes in raw data as input, and returns transformed features that our model can train on:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4AJ9hBs94YJm"
+ },
+ "outputs": [],
+ "source": [
+ "_taxi_transform_module_file = 'taxi_transform.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MYmxxx9A4YJn"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_taxi_transform_module_file}\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "import tensorflow_transform as tft\n",
+ "\n",
+ "import taxi_constants\n",
+ "\n",
+ "_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS\n",
+ "_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS\n",
+ "_VOCAB_SIZE = taxi_constants.VOCAB_SIZE\n",
+ "_OOV_SIZE = taxi_constants.OOV_SIZE\n",
+ "_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT\n",
+ "_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS\n",
+ "_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS\n",
+ "_FARE_KEY = taxi_constants.FARE_KEY\n",
+ "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
+ "\n",
+ "\n",
+ "def preprocessing_fn(inputs):\n",
+ " \"\"\"tf.transform's callback function for preprocessing inputs.\n",
+ " Args:\n",
+ " inputs: map from feature keys to raw not-yet-transformed features.\n",
+ " Returns:\n",
+ " Map from string feature key to transformed feature operations.\n",
+ " \"\"\"\n",
+ " outputs = {}\n",
+ " for key in _DENSE_FLOAT_FEATURE_KEYS:\n",
+ " # If sparse make it dense, setting nan's to 0 or '', and apply zscore.\n",
+ " outputs[key] = tft.scale_to_z_score(\n",
+ " _fill_in_missing(inputs[key]))\n",
+ "\n",
+ " for key in _VOCAB_FEATURE_KEYS:\n",
+ " # Build a vocabulary for this feature.\n",
+ " outputs[key] = tft.compute_and_apply_vocabulary(\n",
+ " _fill_in_missing(inputs[key]),\n",
+ " top_k=_VOCAB_SIZE,\n",
+ " num_oov_buckets=_OOV_SIZE)\n",
+ "\n",
+ " for key in _BUCKET_FEATURE_KEYS:\n",
+ " outputs[key] = tft.bucketize(\n",
+ " _fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT)\n",
+ "\n",
+ " for key in _CATEGORICAL_FEATURE_KEYS:\n",
+ " outputs[key] = _fill_in_missing(inputs[key])\n",
+ "\n",
+ " # Was this passenger a big tipper?\n",
+ " taxi_fare = _fill_in_missing(inputs[_FARE_KEY])\n",
+ " tips = _fill_in_missing(inputs[_LABEL_KEY])\n",
+ " outputs[_LABEL_KEY] = tf.where(\n",
+ " tf.math.is_nan(taxi_fare),\n",
+ " tf.cast(tf.zeros_like(taxi_fare), tf.int64),\n",
+ " # Test if the tip was > 20% of the fare.\n",
+ " tf.cast(\n",
+ " tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))\n",
+ "\n",
+ " return outputs\n",
+ "\n",
+ "\n",
+ "def _fill_in_missing(x):\n",
+ " \"\"\"Replace missing values in a SparseTensor.\n",
+ " Fills in missing values of `x` with '' or 0, and converts to a dense tensor.\n",
+ " Args:\n",
+ " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
+ " in the second dimension.\n",
+ " Returns:\n",
+ " A rank 1 tensor where missing values of `x` have been filled in.\n",
+ " \"\"\"\n",
+ " if not isinstance(x, tf.sparse.SparseTensor):\n",
+ " return x\n",
+ "\n",
+ " default_value = '' if x.dtype == tf.string else 0\n",
+ " return tf.squeeze(\n",
+ " tf.sparse.to_dense(\n",
+ " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
+ " default_value),\n",
+ " axis=1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wgbmZr3sgbWW"
+ },
+ "source": [
+ "Now, we pass in this feature engineering code to the `Transform` component and run it to transform your data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "jHfhth_GiZI9"
+ },
+ "outputs": [],
+ "source": [
+ "transform = tfx.components.Transform(\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " module_file=os.path.abspath(_taxi_transform_module_file))\n",
+ "context.run(transform)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fwAwb4rARRQ2"
+ },
+ "source": [
+ "Let's examine the output artifacts of `Transform`. This component produces two types of outputs:\n",
+ "\n",
+ "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
+ "* `transformed_examples` represents the preprocessed training and evaluation data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SClrAaEGR1O5"
+ },
+ "outputs": [],
+ "source": [
+ "transform.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vyFkBd9AR1sy"
+ },
+ "source": [
+ "Take a peek at the `transform_graph` artifact. It points to a directory containing three subdirectories."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "5tRw4DneR3i7"
+ },
+ "outputs": [],
+ "source": [
+ "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
+ "os.listdir(train_uri)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4fqV54CIR6Pu"
+ },
+ "source": [
+ "The `transformed_metadata` subdirectory contains the schema of the preprocessed data. The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data.\n",
+ "\n",
+ "We can also take a look at the first three transformed examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pwbW2zPKR_S4"
+ },
+ "outputs": [],
+ "source": [
+ "# Get the URI of the output artifact representing the transformed examples, which is a directory\n",
+ "train_uri = os.path.join(transform.outputs['transformed_examples'].get()[0].uri, 'Split-train')\n",
+ "\n",
+ "# Get the list of files in this directory (all compressed TFRecord files)\n",
+ "tfrecord_filenames = [os.path.join(train_uri, name)\n",
+ " for name in os.listdir(train_uri)]\n",
+ "\n",
+ "# Create a `TFRecordDataset` to read these files\n",
+ "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ "\n",
+ "# Iterate over the first 3 records and decode them.\n",
+ "for tfrecord in dataset.take(3):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q_b_V6eN4f69"
+ },
+ "source": [
+ "After the `Transform` component has transformed your data into features, and the next step is to train a model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OBJFtnl6lCg9"
+ },
+ "source": [
+ "### Trainer\n",
+ "The `Trainer` component will train a model that you define in TensorFlow (either using the Estimator API or the Keras API with [`model_to_estimator`](https://www.tensorflow.org/api_docs/python/tf/keras/estimator/model_to_estimator)).\n",
+ "\n",
+ "`Trainer` takes as input the schema from `SchemaGen`, the transformed data and graph from `Transform`, training parameters, as well as a module that contains user-defined model code.\n",
+ "\n",
+ "Let's see an example of user-defined model code below (for an introduction to the TensorFlow Estimator APIs, [see the tutorial](https://www.tensorflow.org/tutorials/estimator/premade)):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "N1376oq04YJt"
+ },
+ "outputs": [],
+ "source": [
+ "_taxi_trainer_module_file = 'taxi_trainer.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "nf9UuNng4YJu"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_taxi_trainer_module_file}\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "import tensorflow_model_analysis as tfma\n",
+ "import tensorflow_transform as tft\n",
+ "from tensorflow_transform.tf_metadata import schema_utils\n",
+ "from tfx_bsl.tfxio import dataset_options\n",
+ "\n",
+ "import taxi_constants\n",
+ "\n",
+ "_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS\n",
+ "_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS\n",
+ "_VOCAB_SIZE = taxi_constants.VOCAB_SIZE\n",
+ "_OOV_SIZE = taxi_constants.OOV_SIZE\n",
+ "_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT\n",
+ "_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS\n",
+ "_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS\n",
+ "_MAX_CATEGORICAL_FEATURE_VALUES = taxi_constants.MAX_CATEGORICAL_FEATURE_VALUES\n",
+ "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
+ "\n",
+ "\n",
+ "# Tf.Transform considers these features as \"raw\"\n",
+ "def _get_raw_feature_spec(schema):\n",
+ " return schema_utils.schema_as_feature_spec(schema).feature_spec\n",
+ "\n",
+ "\n",
+ "def _build_estimator(config, hidden_units=None, warm_start_from=None):\n",
+ " \"\"\"Build an estimator for predicting the tipping behavior of taxi riders.\n",
+ " Args:\n",
+ " config: tf.estimator.RunConfig defining the runtime environment for the\n",
+ " estimator (including model_dir).\n",
+ " hidden_units: [int], the layer sizes of the DNN (input layer first)\n",
+ " warm_start_from: Optional directory to warm start from.\n",
+ " Returns:\n",
+ " A dict of the following:\n",
+ " - estimator: The estimator that will be used for training and eval.\n",
+ " - train_spec: Spec for training.\n",
+ " - eval_spec: Spec for eval.\n",
+ " - eval_input_receiver_fn: Input function for eval.\n",
+ " \"\"\"\n",
+ " real_valued_columns = [\n",
+ " tf.feature_column.numeric_column(key, shape=())\n",
+ " for key in _DENSE_FLOAT_FEATURE_KEYS\n",
+ " ]\n",
+ " categorical_columns = [\n",
+ " tf.feature_column.categorical_column_with_identity(\n",
+ " key, num_buckets=_VOCAB_SIZE + _OOV_SIZE, default_value=0)\n",
+ " for key in _VOCAB_FEATURE_KEYS\n",
+ " ]\n",
+ " categorical_columns += [\n",
+ " tf.feature_column.categorical_column_with_identity(\n",
+ " key, num_buckets=_FEATURE_BUCKET_COUNT, default_value=0)\n",
+ " for key in _BUCKET_FEATURE_KEYS\n",
+ " ]\n",
+ " categorical_columns += [\n",
+ " tf.feature_column.categorical_column_with_identity( # pylint: disable=g-complex-comprehension\n",
+ " key,\n",
+ " num_buckets=num_buckets,\n",
+ " default_value=0) for key, num_buckets in zip(\n",
+ " _CATEGORICAL_FEATURE_KEYS,\n",
+ " _MAX_CATEGORICAL_FEATURE_VALUES)\n",
+ " ]\n",
+ " return tf.estimator.DNNLinearCombinedClassifier(\n",
+ " config=config,\n",
+ " linear_feature_columns=categorical_columns,\n",
+ " dnn_feature_columns=real_valued_columns,\n",
+ " dnn_hidden_units=hidden_units or [100, 70, 50, 25],\n",
+ " warm_start_from=warm_start_from)\n",
+ "\n",
+ "\n",
+ "def _example_serving_receiver_fn(tf_transform_graph, schema):\n",
+ " \"\"\"Build the serving in inputs.\n",
+ " Args:\n",
+ " tf_transform_graph: A TFTransformOutput.\n",
+ " schema: the schema of the input data.\n",
+ " Returns:\n",
+ " Tensorflow graph which parses examples, applying tf-transform to them.\n",
+ " \"\"\"\n",
+ " raw_feature_spec = _get_raw_feature_spec(schema)\n",
+ " raw_feature_spec.pop(_LABEL_KEY)\n",
+ "\n",
+ " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
+ " raw_feature_spec, default_batch_size=None)\n",
+ " serving_input_receiver = raw_input_fn()\n",
+ "\n",
+ " transformed_features = tf_transform_graph.transform_raw_features(\n",
+ " serving_input_receiver.features)\n",
+ "\n",
+ " return tf.estimator.export.ServingInputReceiver(\n",
+ " transformed_features, serving_input_receiver.receiver_tensors)\n",
+ "\n",
+ "\n",
+ "def _eval_input_receiver_fn(tf_transform_graph, schema):\n",
+ " \"\"\"Build everything needed for the tf-model-analysis to run the model.\n",
+ " Args:\n",
+ " tf_transform_graph: A TFTransformOutput.\n",
+ " schema: the schema of the input data.\n",
+ " Returns:\n",
+ " EvalInputReceiver function, which contains:\n",
+ " - Tensorflow graph which parses raw untransformed features, applies the\n",
+ " tf-transform preprocessing operators.\n",
+ " - Set of raw, untransformed features.\n",
+ " - Label against which predictions will be compared.\n",
+ " \"\"\"\n",
+ " # Notice that the inputs are raw features, not transformed features here.\n",
+ " raw_feature_spec = _get_raw_feature_spec(schema)\n",
+ "\n",
+ " serialized_tf_example = tf.compat.v1.placeholder(\n",
+ " dtype=tf.string, shape=[None], name='input_example_tensor')\n",
+ "\n",
+ " # Add a parse_example operator to the tensorflow graph, which will parse\n",
+ " # raw, untransformed, tf examples.\n",
+ " features = tf.io.parse_example(serialized_tf_example, raw_feature_spec)\n",
+ "\n",
+ " # Now that we have our raw examples, process them through the tf-transform\n",
+ " # function computed during the preprocessing step.\n",
+ " transformed_features = tf_transform_graph.transform_raw_features(\n",
+ " features)\n",
+ "\n",
+ " # The key name MUST be 'examples'.\n",
+ " receiver_tensors = {'examples': serialized_tf_example}\n",
+ "\n",
+ " # NOTE: Model is driven by transformed features (since training works on the\n",
+ " # materialized output of TFT, but slicing will happen on raw features.\n",
+ " features.update(transformed_features)\n",
+ "\n",
+ " return tfma.export.EvalInputReceiver(\n",
+ " features=features,\n",
+ " receiver_tensors=receiver_tensors,\n",
+ " labels=transformed_features[_LABEL_KEY])\n",
+ "\n",
+ "\n",
+ "def _input_fn(file_pattern, data_accessor, tf_transform_output, batch_size=200):\n",
+ " \"\"\"Generates features and label for tuning/training.\n",
+ "\n",
+ " Args:\n",
+ " file_pattern: List of paths or patterns of input tfrecord files.\n",
+ " data_accessor: DataAccessor for converting input to RecordBatch.\n",
+ " tf_transform_output: A TFTransformOutput.\n",
+ " batch_size: representing the number of consecutive elements of returned\n",
+ " dataset to combine in a single batch\n",
+ "\n",
+ " Returns:\n",
+ " A dataset that contains (features, indices) tuple where features is a\n",
+ " dictionary of Tensors, and indices is a single Tensor of label indices.\n",
+ " \"\"\"\n",
+ " return data_accessor.tf_dataset_factory(\n",
+ " file_pattern,\n",
+ " dataset_options.TensorFlowDatasetOptions(\n",
+ " batch_size=batch_size, label_key=_LABEL_KEY),\n",
+ " tf_transform_output.transformed_metadata.schema)\n",
+ "\n",
+ "\n",
+ "# TFX will call this function\n",
+ "def trainer_fn(trainer_fn_args, schema):\n",
+ " \"\"\"Build the estimator using the high level API.\n",
+ " Args:\n",
+ " trainer_fn_args: Holds args used to train the model as name/value pairs.\n",
+ " schema: Holds the schema of the training examples.\n",
+ " Returns:\n",
+ " A dict of the following:\n",
+ " - estimator: The estimator that will be used for training and eval.\n",
+ " - train_spec: Spec for training.\n",
+ " - eval_spec: Spec for eval.\n",
+ " - eval_input_receiver_fn: Input function for eval.\n",
+ " \"\"\"\n",
+ " # Number of nodes in the first layer of the DNN\n",
+ " first_dnn_layer_size = 100\n",
+ " num_dnn_layers = 4\n",
+ " dnn_decay_factor = 0.7\n",
+ "\n",
+ " train_batch_size = 40\n",
+ " eval_batch_size = 40\n",
+ "\n",
+ " tf_transform_graph = tft.TFTransformOutput(trainer_fn_args.transform_output)\n",
+ "\n",
+ " train_input_fn = lambda: _input_fn( # pylint: disable=g-long-lambda\n",
+ " trainer_fn_args.train_files,\n",
+ " trainer_fn_args.data_accessor,\n",
+ " tf_transform_graph,\n",
+ " batch_size=train_batch_size)\n",
+ "\n",
+ " eval_input_fn = lambda: _input_fn( # pylint: disable=g-long-lambda\n",
+ " trainer_fn_args.eval_files,\n",
+ " trainer_fn_args.data_accessor,\n",
+ " tf_transform_graph,\n",
+ " batch_size=eval_batch_size)\n",
+ "\n",
+ " train_spec = tf.estimator.TrainSpec( # pylint: disable=g-long-lambda\n",
+ " train_input_fn,\n",
+ " max_steps=trainer_fn_args.train_steps)\n",
+ "\n",
+ " serving_receiver_fn = lambda: _example_serving_receiver_fn( # pylint: disable=g-long-lambda\n",
+ " tf_transform_graph, schema)\n",
+ "\n",
+ " exporter = tf.estimator.FinalExporter('chicago-taxi', serving_receiver_fn)\n",
+ " eval_spec = tf.estimator.EvalSpec(\n",
+ " eval_input_fn,\n",
+ " steps=trainer_fn_args.eval_steps,\n",
+ " exporters=[exporter],\n",
+ " name='chicago-taxi-eval')\n",
+ "\n",
+ " run_config = tf.estimator.RunConfig(\n",
+ " save_checkpoints_steps=999, keep_checkpoint_max=1)\n",
+ "\n",
+ " run_config = run_config.replace(model_dir=trainer_fn_args.serving_model_dir)\n",
+ "\n",
+ " estimator = _build_estimator(\n",
+ " # Construct layers sizes with exponential decay\n",
+ " hidden_units=[\n",
+ " max(2, int(first_dnn_layer_size * dnn_decay_factor**i))\n",
+ " for i in range(num_dnn_layers)\n",
+ " ],\n",
+ " config=run_config,\n",
+ " warm_start_from=trainer_fn_args.base_model)\n",
+ "\n",
+ " # Create an input receiver for TFMA processing\n",
+ " receiver_fn = lambda: _eval_input_receiver_fn( # pylint: disable=g-long-lambda\n",
+ " tf_transform_graph, schema)\n",
+ "\n",
+ " return {\n",
+ " 'estimator': estimator,\n",
+ " 'train_spec': train_spec,\n",
+ " 'eval_spec': eval_spec,\n",
+ " 'eval_input_receiver_fn': receiver_fn\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GY4yTRaX4YJx"
+ },
+ "source": [
+ "Now, we pass in this model code to the `Trainer` component and run it to train the model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "429-vvCWibO0"
+ },
+ "outputs": [],
+ "source": [
+ "from tfx.components.trainer.executor import Executor\n",
+ "from tfx.dsl.components.base import executor_spec\n",
+ "\n",
+ "trainer = tfx.components.Trainer(\n",
+ " module_file=os.path.abspath(_taxi_trainer_module_file),\n",
+ " custom_executor_spec=executor_spec.ExecutorClassSpec(Executor),\n",
+ " examples=transform.outputs['transformed_examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " transform_graph=transform.outputs['transform_graph'],\n",
+ " train_args=tfx.proto.TrainArgs(num_steps=10000),\n",
+ " eval_args=tfx.proto.EvalArgs(num_steps=5000))\n",
+ "context.run(trainer)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6Cql1G35StJp"
+ },
+ "source": [
+ "#### Analyze Training with TensorBoard\n",
+ "Optionally, we can connect TensorBoard to the Trainer to analyze our model's training curves."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bXe62WE0S0Ek"
+ },
+ "outputs": [],
+ "source": [
+ "# Get the URI of the output artifact representing the training logs, which is a directory\n",
+ "model_run_dir = trainer.outputs['model_run'].get()[0].uri\n",
+ "\n",
+ "%load_ext tensorboard\n",
+ "%tensorboard --logdir {model_run_dir}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FmPftrv0lEQy"
+ },
+ "source": [
+ "### Evaluator\n",
+ "The `Evaluator` component computes model performance metrics over the evaluation set. It uses the [TensorFlow Model Analysis](https://www.tensorflow.org/tfx/model_analysis/get_started) library. The `Evaluator` can also optionally validate that a newly trained model is better than the previous model. This is useful in a production pipeline setting where you may automatically train and validate a model every day. In this notebook, we only train one model, so the `Evaluator` automatically will label the model as \"good\".\n",
+ "\n",
+ "`Evaluator` will take as input the data from `ExampleGen`, the trained model from `Trainer`, and slicing configuration. The slicing configuration allows you to slice your metrics on feature values (e.g. how does your model perform on taxi trips that start at 8am versus 8pm?). See an example of this configuration below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "fVhfzzh9PDEx"
+ },
+ "outputs": [],
+ "source": [
+ "eval_config = tfma.EvalConfig(\n",
+ " model_specs=[\n",
+ " # Using signature 'eval' implies the use of an EvalSavedModel. To use\n",
+ " # a serving model remove the signature to defaults to 'serving_default'\n",
+ " # and add a label_key.\n",
+ " tfma.ModelSpec(signature_name='eval')\n",
+ " ],\n",
+ " metrics_specs=[\n",
+ " tfma.MetricsSpec(\n",
+ " # The metrics added here are in addition to those saved with the\n",
+ " # model (assuming either a keras model or EvalSavedModel is used).\n",
+ " # Any metrics added into the saved model (for example using\n",
+ " # model.compile(..., metrics=[...]), etc) will be computed\n",
+ " # automatically.\n",
+ " metrics=[\n",
+ " tfma.MetricConfig(class_name='ExampleCount')\n",
+ " ],\n",
+ " # To add validation thresholds for metrics saved with the model,\n",
+ " # add them keyed by metric name to the thresholds map.\n",
+ " thresholds = {\n",
+ " 'accuracy': tfma.MetricThreshold(\n",
+ " value_threshold=tfma.GenericValueThreshold(\n",
+ " lower_bound={'value': 0.5}),\n",
+ " # Change threshold will be ignored if there is no\n",
+ " # baseline model resolved from MLMD (first run).\n",
+ " change_threshold=tfma.GenericChangeThreshold(\n",
+ " direction=tfma.MetricDirection.HIGHER_IS_BETTER,\n",
+ " absolute={'value': -1e-10}))\n",
+ " }\n",
+ " )\n",
+ " ],\n",
+ " slicing_specs=[\n",
+ " # An empty slice spec means the overall slice, i.e. the whole dataset.\n",
+ " tfma.SlicingSpec(),\n",
+ " # Data can be sliced along a feature column. In this case, data is\n",
+ " # sliced along feature column trip_start_hour.\n",
+ " tfma.SlicingSpec(feature_keys=['trip_start_hour'])\n",
+ " ])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9mBdKH1F8JuT"
+ },
+ "source": [
+ "Next, we give this configuration to `Evaluator` and run it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Zjcx8g6mihSt"
+ },
+ "outputs": [],
+ "source": [
+ "# Use TFMA to compute a evaluation statistics over features of a model and\n",
+ "# validate them against a baseline.\n",
+ "\n",
+ "# The model resolver is only required if performing model validation in addition\n",
+ "# to evaluation. In this case we validate against the latest blessed model. If\n",
+ "# no model has been blessed before (as in this case) the evaluator will make our\n",
+ "# candidate the first blessed model.\n",
+ "model_resolver = tfx.dsl.Resolver(\n",
+ " strategy_class=tfx.dsl.experimental.LatestBlessedModelStrategy,\n",
+ " model=tfx.dsl.Channel(type=tfx.types.standard_artifacts.Model),\n",
+ " model_blessing=tfx.dsl.Channel(\n",
+ " type=tfx.types.standard_artifacts.ModelBlessing)).with_id(\n",
+ " 'latest_blessed_model_resolver')\n",
+ "context.run(model_resolver)\n",
+ "\n",
+ "evaluator = tfx.components.Evaluator(\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " model=trainer.outputs['model'],\n",
+ " eval_config=eval_config)\n",
+ "context.run(evaluator)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AeCVkBusS_8g"
+ },
+ "source": [
+ "Now let's examine the output artifacts of `Evaluator`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "k4GghePOTJxL"
+ },
+ "outputs": [],
+ "source": [
+ "evaluator.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Y5TMskWe9LL0"
+ },
+ "source": [
+ "Using the `evaluation` output we can show the default visualization of global metrics on the entire evaluation set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "U729j5X5QQUQ"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(evaluator.outputs['evaluation'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t-tI4p6m-OAn"
+ },
+ "source": [
+ "To see the visualization for sliced evaluation metrics, we can directly call the TensorFlow Model Analysis library."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pyis6iy0HLdi"
+ },
+ "outputs": [],
+ "source": [
+ "import tensorflow_model_analysis as tfma\n",
+ "\n",
+ "# Get the TFMA output result path and load the result.\n",
+ "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
+ "tfma_result = tfma.load_eval_result(PATH_TO_RESULT)\n",
+ "\n",
+ "# Show data sliced along feature column trip_start_hour.\n",
+ "tfma.view.render_slicing_metrics(\n",
+ " tfma_result, slicing_column='trip_start_hour')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7uvYrUf2-r_6"
+ },
+ "source": [
+ "This visualization shows the same metrics, but computed at every feature value of `trip_start_hour` instead of on the entire evaluation set.\n",
+ "\n",
+ "TensorFlow Model Analysis supports many other visualizations, such as Fairness Indicators and plotting a time series of model performance. To learn more, see [the tutorial](/tutorials/model_analysis/tfma_basic)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TEotnkxEswUb"
+ },
+ "source": [
+ "Since we added thresholds to our config, validation output is also available. The presence of a `blessing` artifact indicates that our model passed validation. Since this is the first validation being performed the candidate is automatically blessed."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "FZmiRtg6TKtR"
+ },
+ "outputs": [],
+ "source": [
+ "blessing_uri = evaluator.outputs['blessing'].get()[0].uri\n",
+ "!ls -l {blessing_uri}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hM1tFkOVSBa0"
+ },
+ "source": [
+ "Now can also verify the success by loading the validation result record:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "lxa5G08bSJ8a"
+ },
+ "outputs": [],
+ "source": [
+ "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
+ "print(tfma.load_validation_result(PATH_TO_RESULT))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T8DYekCZlHfj"
+ },
+ "source": [
+ "### Pusher\n",
+ "The `Pusher` component is usually at the end of a TFX pipeline. It checks whether a model has passed validation, and if so, exports the model to `_serving_model_dir`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "r45nQ69eikc9"
+ },
+ "outputs": [],
+ "source": [
+ "pusher = tfx.components.Pusher(\n",
+ " model=trainer.outputs['model'],\n",
+ " model_blessing=evaluator.outputs['blessing'],\n",
+ " push_destination=tfx.proto.PushDestination(\n",
+ " filesystem=tfx.proto.PushDestination.Filesystem(\n",
+ " base_directory=_serving_model_dir)))\n",
+ "context.run(pusher)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ctUErBYoTO9I"
+ },
+ "source": [
+ "Let's examine the output artifacts of `Pusher`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pRkWo-MzTSss"
+ },
+ "outputs": [],
+ "source": [
+ "pusher.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "peH2PPS3VgkL"
+ },
+ "source": [
+ "In particular, the Pusher will export your model in the SavedModel format, which looks like this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4zyIqWl9TSdG"
+ },
+ "outputs": [],
+ "source": [
+ "push_uri = pusher.outputs['pushed_model'].get()[0].uri\n",
+ "model = tf.saved_model.load(push_uri)\n",
+ "\n",
+ "for item in model.signatures.items():\n",
+ " pp.pprint(item)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3-YPNUuHANtj"
+ },
+ "source": [
+ "We're finished our tour of built-in TFX components!"
+ ]
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "collapsed_sections": [
+ "wdeKOEkv1Fe8"
+ ],
+ "name": "components.ipynb",
+ "private_outputs": true,
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
}
diff --git a/docs/tutorials/tfx/components_keras.ipynb b/docs/tutorials/tfx/components_keras.ipynb
index 37d3843ae1..f6bdd37aa7 100644
--- a/docs/tutorials/tfx/components_keras.ipynb
+++ b/docs/tutorials/tfx/components_keras.ipynb
@@ -1,53 +1,53 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wdeKOEkv1Fe8"
- },
- "source": [
- "##### Copyright 2021 The TensorFlow Authors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "id": "c2jyGuiG1gHr"
- },
- "outputs": [],
- "source": [
- "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "23R0Z9RojXYW"
- },
- "source": [
- "# TFX Keras Component Tutorial\n",
- "\n",
- "***A Component-by-Component Introduction to TensorFlow Extended (TFX)***"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "LidV2qsXm4XC"
- },
- "source": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wdeKOEkv1Fe8"
+ },
+ "source": [
+ "##### Copyright 2021 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "c2jyGuiG1gHr"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "23R0Z9RojXYW"
+ },
+ "source": [
+ "# TFX Keras Component Tutorial\n",
+ "\n",
+ "***A Component-by-Component Introduction to TensorFlow Extended (TFX)***"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LidV2qsXm4XC"
+ },
+ "source": [
"Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n",
"\n",
"\n",
@@ -84,1519 +84,1519 @@
" \n",
"
"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "KAD1tLoTm_QS"
- },
- "source": [
- "\n",
- "This Colab-based tutorial will interactively walk through each built-in component of TensorFlow Extended (TFX).\n",
- "\n",
- "It covers every step in an end-to-end machine learning pipeline, from data ingestion to pushing a model to serving.\n",
- "\n",
- "When you're done, the contents of this notebook can be automatically exported as TFX pipeline source code, which you can orchestrate with Apache Airflow and Apache Beam.\n",
- "\n",
- "Note: This notebook demonstrates the use of native Keras models in TFX pipelines. **TFX only supports the TensorFlow 2 version of Keras**."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "sfSQ-kX-MLEr"
- },
- "source": [
- "## Background\n",
- "This notebook demonstrates how to use TFX in a Jupyter/Colab environment. Here, we walk through the Chicago Taxi example in an interactive notebook.\n",
- "\n",
- "Working in an interactive notebook is a useful way to become familiar with the structure of a TFX pipeline. It's also useful when doing development of your own pipelines as a lightweight development environment, but you should be aware that there are differences in the way interactive notebooks are orchestrated, and how they access metadata artifacts.\n",
- "\n",
- "### Orchestration\n",
- "\n",
- "In a production deployment of TFX, you will use an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. In an interactive notebook, the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells.\n",
- "\n",
- "### Metadata\n",
- "\n",
- "In a production deployment of TFX, you will access metadata through the ML Metadata (MLMD) API. MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. In an interactive notebook, both properties and payloads are stored in an ephemeral SQLite database in the `/tmp` directory on the Jupyter notebook or Colab server."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2GivNBNYjb3b"
- },
- "source": [
- "## Setup\n",
- "First, we install and import the necessary packages, set up paths, and download data."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Fmgi8ZvQkScg"
- },
- "source": [
- "### Upgrade Pip\n",
- "\n",
- "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "as4OTe2ukSqm"
- },
- "outputs": [],
- "source": [
- "import sys\n",
- "if 'google.colab' in sys.modules:\n",
- " !pip install --upgrade pip"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "MZOYTt1RW4TK"
- },
- "source": [
- "### Install TFX\n",
- "\n",
- "**Note: In Google Colab, because of package updates, the first time you run this cell you must restart the runtime (Runtime \u003e Restart runtime ...).**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "S4SQA7Q5nej3"
- },
- "outputs": [],
- "source": [
- "!pip install tfx"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "EwT0nov5QO1M"
- },
- "source": [
- "## Did you restart the runtime?\n",
- "\n",
- "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime \u003e Restart runtime ...). This is because of the way that Colab loads packages."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "N-ePgV0Lj68Q"
- },
- "source": [
- "### Import packages\n",
- "We import necessary packages, including standard TFX component classes."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "YIqpWK9efviJ"
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import pprint\n",
- "import tempfile\n",
- "import urllib\n",
- "\n",
- "import absl\n",
- "import tensorflow as tf\n",
- "import tensorflow_model_analysis as tfma\n",
- "tf.get_logger().propagate = False\n",
- "pp = pprint.PrettyPrinter()\n",
- "\n",
- "from tfx import v1 as tfx\n",
- "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
- "\n",
- "%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wCZTHRy0N1D6"
- },
- "source": [
- "Let's check the library versions."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "eZ4K18_DN2D8"
- },
- "outputs": [],
- "source": [
- "print('TensorFlow version: {}'.format(tf.__version__))\n",
- "print('TFX version: {}'.format(tfx.__version__))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ufJKQ6OvkJlY"
- },
- "source": [
- "### Set up pipeline paths"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ad5JLpKbf6sN"
- },
- "outputs": [],
- "source": [
- "# This is the root directory for your TFX pip package installation.\n",
- "_tfx_root = tfx.__path__[0]\n",
- "\n",
- "# This is the directory containing the TFX Chicago Taxi Pipeline example.\n",
- "_taxi_root = os.path.join(_tfx_root, 'examples/chicago_taxi_pipeline')\n",
- "\n",
- "# This is the path where your model will be pushed for serving.\n",
- "_serving_model_dir = os.path.join(\n",
- " tempfile.mkdtemp(), 'serving_model/taxi_simple')\n",
- "\n",
- "# Set up logging.\n",
- "absl.logging.set_verbosity(absl.logging.INFO)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "n2cMMAbSkGfX"
- },
- "source": [
- "### Download example data\n",
- "We download the example dataset for use in our TFX pipeline.\n",
- "\n",
- "The dataset we're using is the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) released by the City of Chicago. The columns in this dataset are:\n",
- "\n",
- "\u003ctable\u003e\n",
- "\u003ctr\u003e\u003ctd\u003epickup_community_area\u003c/td\u003e\u003ctd\u003efare\u003c/td\u003e\u003ctd\u003etrip_start_month\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003etrip_start_hour\u003c/td\u003e\u003ctd\u003etrip_start_day\u003c/td\u003e\u003ctd\u003etrip_start_timestamp\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003epickup_latitude\u003c/td\u003e\u003ctd\u003epickup_longitude\u003c/td\u003e\u003ctd\u003edropoff_latitude\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003edropoff_longitude\u003c/td\u003e\u003ctd\u003etrip_miles\u003c/td\u003e\u003ctd\u003epickup_census_tract\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003edropoff_census_tract\u003c/td\u003e\u003ctd\u003epayment_type\u003c/td\u003e\u003ctd\u003ecompany\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003ctr\u003e\u003ctd\u003etrip_seconds\u003c/td\u003e\u003ctd\u003edropoff_community_area\u003c/td\u003e\u003ctd\u003etips\u003c/td\u003e\u003c/tr\u003e\n",
- "\u003c/table\u003e\n",
- "\n",
- "With this dataset, we will build a model that predicts the `tips` of a trip."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "BywX6OUEhAqn"
- },
- "outputs": [],
- "source": [
- "_data_root = tempfile.mkdtemp(prefix='tfx-data')\n",
- "DATA_PATH = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv'\n",
- "_data_filepath = os.path.join(_data_root, \"data.csv\")\n",
- "urllib.request.urlretrieve(DATA_PATH, _data_filepath)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "blZC1sIQOWfH"
- },
- "source": [
- "Take a quick look at the CSV file."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "c5YPeLPFOXaD"
- },
- "outputs": [],
- "source": [
- "!head {_data_filepath}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "QioyhunCImwE"
- },
- "source": [
- "*Disclaimer: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8ONIE_hdkPS4"
- },
- "source": [
- "### Create the InteractiveContext\n",
- "Last, we create an InteractiveContext, which will allow us to run TFX components interactively in this notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "0Rh6K5sUf9dd"
- },
- "outputs": [],
- "source": [
- "# Here, we create an InteractiveContext using default parameters. This will\n",
- "# use a temporary directory with an ephemeral ML Metadata database instance.\n",
- "# To use your own pipeline root or database, the optional properties\n",
- "# `pipeline_root` and `metadata_connection_config` may be passed to\n",
- "# InteractiveContext. Calls to InteractiveContext are no-ops outside of the\n",
- "# notebook.\n",
- "context = InteractiveContext()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HdQWxfsVkzdJ"
- },
- "source": [
- "## Run TFX components interactively\n",
- "In the cells that follow, we create TFX components one-by-one, run each of them, and visualize their output artifacts."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "L9fwt9gQk3BR"
- },
- "source": [
- "### ExampleGen\n",
- "\n",
- "The `ExampleGen` component is usually at the start of a TFX pipeline. It will:\n",
- "\n",
- "1. Split data into training and evaluation sets (by default, 2/3 training + 1/3 eval)\n",
- "2. Convert data into the `tf.Example` format (learn more [here](https://www.tensorflow.org/tutorials/load_data/tfrecord))\n",
- "3. Copy data into the `_tfx_root` directory for other components to access\n",
- "\n",
- "`ExampleGen` takes as input the path to your data source. In our case, this is the `_data_root` path that contains the downloaded CSV.\n",
- "\n",
- "Note: In this notebook, we can instantiate components one-by-one and run them with `InteractiveContext.run()`. By contrast, in a production setting, we would specify all the components upfront in a `Pipeline` to pass to the orchestrator (see the [Building a TFX Pipeline Guide](../../../guide/build_tfx_pipeline)).\n",
- "\n",
- "#### Enabling the Cache\n",
- "When using the `InteractiveContext` in a notebook to develop a pipeline you can control when individual components will cache their outputs. Set `enable_cache` to `True` when you want to reuse the previous output artifacts that the component generated. Set `enable_cache` to `False` when you want to recompute the output artifacts for a component, if you are making changes to the code for example."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "PyXjuMt8f-9u"
- },
- "outputs": [],
- "source": [
- "example_gen = tfx.components.CsvExampleGen(input_base=_data_root)\n",
- "context.run(example_gen, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OqCoZh7KPUm9"
- },
- "source": [
- "Let's examine the output artifacts of `ExampleGen`. This component produces two artifacts, training examples and evaluation examples:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "880KkTAkPeUg"
- },
- "outputs": [],
- "source": [
- "artifact = example_gen.outputs['examples'].get()[0]\n",
- "print(artifact.split_names, artifact.uri)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "J6vcbW_wPqvl"
- },
- "source": [
- "We can also take a look at the first three training examples:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "H4XIXjiCPwzQ"
- },
- "outputs": [],
- "source": [
- "# Get the URI of the output artifact representing the training examples, which is a directory\n",
- "train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')\n",
- "\n",
- "# Get the list of files in this directory (all compressed TFRecord files)\n",
- "tfrecord_filenames = [os.path.join(train_uri, name)\n",
- " for name in os.listdir(train_uri)]\n",
- "\n",
- "# Create a `TFRecordDataset` to read these files\n",
- "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- "\n",
- "# Iterate over the first 3 records and decode them.\n",
- "for tfrecord in dataset.take(3):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2gluYjccf-IP"
- },
- "source": [
- "Now that `ExampleGen` has finished ingesting the data, the next step is data analysis."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "csM6BFhtk5Aa"
- },
- "source": [
- "### StatisticsGen\n",
- "The `StatisticsGen` component computes statistics over your dataset for data analysis, as well as for use in downstream components. It uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "`StatisticsGen` takes as input the dataset we just ingested using `ExampleGen`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MAscCCYWgA-9"
- },
- "outputs": [],
- "source": [
- "statistics_gen = tfx.components.StatisticsGen(\n",
- " examples=example_gen.outputs['examples'])\n",
- "context.run(statistics_gen, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HLI6cb_5WugZ"
- },
- "source": [
- "After `StatisticsGen` finishes running, we can visualize the outputted statistics. Try playing with the different plots!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tLjXy7K6Tp_G"
- },
- "outputs": [],
- "source": [
- "context.show(statistics_gen.outputs['statistics'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HLKLTO9Nk60p"
- },
- "source": [
- "### SchemaGen\n",
- "\n",
- "The `SchemaGen` component generates a schema based on your data statistics. (A schema defines the expected bounds, types, and properties of the features in your dataset.) It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "Note: The generated schema is best-effort and only tries to infer basic properties of the data. It is expected that you review and modify it as needed.\n",
- "\n",
- "`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ygQvZ6hsiQ_J"
- },
- "outputs": [],
- "source": [
- "schema_gen = tfx.components.SchemaGen(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " infer_feature_shape=False)\n",
- "context.run(schema_gen, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "zi6TxTUKXM6b"
- },
- "source": [
- "After `SchemaGen` finishes running, we can visualize the generated schema as a table."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Ec9vqDXpXeMb"
- },
- "outputs": [],
- "source": [
- "context.show(schema_gen.outputs['schema'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kZWWdbA-m7zp"
- },
- "source": [
- "Each feature in your dataset shows up as a row in the schema table, alongside its properties. The schema also captures all the values that a categorical feature takes on, denoted as its domain.\n",
- "\n",
- "To learn more about schemas, see [the SchemaGen documentation](../../../guide/schemagen)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "V1qcUuO9k9f8"
- },
- "source": [
- "### ExampleValidator\n",
- "The `ExampleValidator` component detects anomalies in your data, based on the expectations defined by the schema. It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "`ExampleValidator` will take as input the statistics from `StatisticsGen`, and the schema from `SchemaGen`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XRlRUuGgiXks"
- },
- "outputs": [],
- "source": [
- "example_validator = tfx.components.ExampleValidator(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " schema=schema_gen.outputs['schema'])\n",
- "context.run(example_validator, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "855mrHgJcoer"
- },
- "source": [
- "After `ExampleValidator` finishes running, we can visualize the anomalies as a table."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "TDyAAozQcrk3"
- },
- "outputs": [],
- "source": [
- "context.show(example_validator.outputs['anomalies'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "znMoJj60ybZx"
- },
- "source": [
- "In the anomalies table, we can see that there are no anomalies. This is what we'd expect, since this the first dataset that we've analyzed and the schema is tailored to it. You should review this schema -- anything unexpected means an anomaly in the data. Once reviewed, the schema can be used to guard future data, and anomalies produced here can be used to debug model performance, understand how your data evolves over time, and identify data errors."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "JPViEz5RlA36"
- },
- "source": [
- "### Transform\n",
- "The `Transform` component performs feature engineering for both training and serving. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library.\n",
- "\n",
- "`Transform` will take as input the data from `ExampleGen`, the schema from `SchemaGen`, as well as a module that contains user-defined Transform code.\n",
- "\n",
- "Let's see an example of user-defined Transform code below (for an introduction to the TensorFlow Transform APIs, [see the tutorial](/tutorials/transform/simple)). First, we define a few constants for feature engineering:\n",
- "\n",
- "Note: The `%%writefile` cell magic will save the contents of the cell as a `.py` file on disk. This allows the `Transform` component to load your code as a module.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "PuNSiUKb4YJf"
- },
- "outputs": [],
- "source": [
- "_taxi_constants_module_file = 'taxi_constants.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "HPjhXuIF4YJh"
- },
- "outputs": [],
- "source": [
- "%%writefile {_taxi_constants_module_file}\n",
- "\n",
- "NUMERICAL_FEATURES = ['trip_miles', 'fare', 'trip_seconds']\n",
- "\n",
- "BUCKET_FEATURES = [\n",
- " 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',\n",
- " 'dropoff_longitude'\n",
- "]\n",
- "# Number of buckets used by tf.transform for encoding each feature.\n",
- "FEATURE_BUCKET_COUNT = 10\n",
- "\n",
- "CATEGORICAL_NUMERICAL_FEATURES = [\n",
- " 'trip_start_hour', 'trip_start_day', 'trip_start_month',\n",
- " 'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',\n",
- " 'dropoff_community_area'\n",
- "]\n",
- "\n",
- "CATEGORICAL_STRING_FEATURES = [\n",
- " 'payment_type',\n",
- " 'company',\n",
- "]\n",
- "\n",
- "# Number of vocabulary terms used for encoding categorical features.\n",
- "VOCAB_SIZE = 1000\n",
- "\n",
- "# Count of out-of-vocab buckets in which unrecognized categorical are hashed.\n",
- "OOV_SIZE = 10\n",
- "\n",
- "# Keys\n",
- "LABEL_KEY = 'tips'\n",
- "FARE_KEY = 'fare'\n",
- "\n",
- "def t_name(key):\n",
- " \"\"\"\n",
- " Rename the feature keys so that they don't clash with the raw keys when\n",
- " running the Evaluator component.\n",
- " Args:\n",
- " key: The original feature key\n",
- " Returns:\n",
- " key with '_xf' appended\n",
- " \"\"\"\n",
- " return key + '_xf'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Duj2Ax5z4YJl"
- },
- "source": [
- "Next, we write a `preprocessing_fn` that takes in raw data as input, and returns transformed features that our model can train on:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4AJ9hBs94YJm"
- },
- "outputs": [],
- "source": [
- "_taxi_transform_module_file = 'taxi_transform.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MYmxxx9A4YJn"
- },
- "outputs": [],
- "source": [
- "%%writefile {_taxi_transform_module_file}\n",
- "\n",
- "import tensorflow as tf\n",
- "import tensorflow_transform as tft\n",
- "\n",
- "# Imported files such as taxi_constants are normally cached, so changes are\n",
- "# not honored after the first import. Normally this is good for efficiency, but\n",
- "# during development when we may be iterating code it can be a problem. To\n",
- "# avoid this problem during development, reload the file.\n",
- "import taxi_constants\n",
- "import sys\n",
- "if 'google.colab' in sys.modules: # Testing to see if we're doing development\n",
- " import importlib\n",
- " importlib.reload(taxi_constants)\n",
- "\n",
- "_NUMERICAL_FEATURES = taxi_constants.NUMERICAL_FEATURES\n",
- "_BUCKET_FEATURES = taxi_constants.BUCKET_FEATURES\n",
- "_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT\n",
- "_CATEGORICAL_NUMERICAL_FEATURES = taxi_constants.CATEGORICAL_NUMERICAL_FEATURES\n",
- "_CATEGORICAL_STRING_FEATURES = taxi_constants.CATEGORICAL_STRING_FEATURES\n",
- "_VOCAB_SIZE = taxi_constants.VOCAB_SIZE\n",
- "_OOV_SIZE = taxi_constants.OOV_SIZE\n",
- "_FARE_KEY = taxi_constants.FARE_KEY\n",
- "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
- "\n",
- "\n",
- "def _make_one_hot(x, key):\n",
- " \"\"\"Make a one-hot tensor to encode categorical features.\n",
- " Args:\n",
- " X: A dense tensor\n",
- " key: A string key for the feature in the input\n",
- " Returns:\n",
- " A dense one-hot tensor as a float list\n",
- " \"\"\"\n",
- " integerized = tft.compute_and_apply_vocabulary(x,\n",
- " top_k=_VOCAB_SIZE,\n",
- " num_oov_buckets=_OOV_SIZE,\n",
- " vocab_filename=key, name=key)\n",
- " depth = (\n",
- " tft.experimental.get_vocabulary_size_by_name(key) + _OOV_SIZE)\n",
- " one_hot_encoded = tf.one_hot(\n",
- " integerized,\n",
- " depth=tf.cast(depth, tf.int32),\n",
- " on_value=1.0,\n",
- " off_value=0.0)\n",
- " return tf.reshape(one_hot_encoded, [-1, depth])\n",
- "\n",
- "\n",
- "def _fill_in_missing(x):\n",
- " \"\"\"Replace missing values in a SparseTensor.\n",
- " Fills in missing values of `x` with '' or 0, and converts to a dense tensor.\n",
- " Args:\n",
- " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
- " in the second dimension.\n",
- " Returns:\n",
- " A rank 1 tensor where missing values of `x` have been filled in.\n",
- " \"\"\"\n",
- " if not isinstance(x, tf.sparse.SparseTensor):\n",
- " return x\n",
- "\n",
- " default_value = '' if x.dtype == tf.string else 0\n",
- " return tf.squeeze(\n",
- " tf.sparse.to_dense(\n",
- " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
- " default_value),\n",
- " axis=1)\n",
- "\n",
- "\n",
- "def preprocessing_fn(inputs):\n",
- " \"\"\"tf.transform's callback function for preprocessing inputs.\n",
- " Args:\n",
- " inputs: map from feature keys to raw not-yet-transformed features.\n",
- " Returns:\n",
- " Map from string feature key to transformed feature operations.\n",
- " \"\"\"\n",
- " outputs = {}\n",
- " for key in _NUMERICAL_FEATURES:\n",
- " # If sparse make it dense, setting nan's to 0 or '', and apply zscore.\n",
- " outputs[taxi_constants.t_name(key)] = tft.scale_to_z_score(\n",
- " _fill_in_missing(inputs[key]), name=key)\n",
- "\n",
- " for key in _BUCKET_FEATURES:\n",
- " outputs[taxi_constants.t_name(key)] = tf.cast(tft.bucketize(\n",
- " _fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT, name=key),\n",
- " dtype=tf.float32)\n",
- "\n",
- " for key in _CATEGORICAL_STRING_FEATURES:\n",
- " outputs[taxi_constants.t_name(key)] = _make_one_hot(_fill_in_missing(inputs[key]), key)\n",
- "\n",
- " for key in _CATEGORICAL_NUMERICAL_FEATURES:\n",
- " outputs[taxi_constants.t_name(key)] = _make_one_hot(tf.strings.strip(\n",
- " tf.strings.as_string(_fill_in_missing(inputs[key]))), key)\n",
- "\n",
- " # Was this passenger a big tipper?\n",
- " taxi_fare = _fill_in_missing(inputs[_FARE_KEY])\n",
- " tips = _fill_in_missing(inputs[_LABEL_KEY])\n",
- " outputs[_LABEL_KEY] = tf.where(\n",
- " tf.math.is_nan(taxi_fare),\n",
- " tf.cast(tf.zeros_like(taxi_fare), tf.int64),\n",
- " # Test if the tip was \u003e 20% of the fare.\n",
- " tf.cast(\n",
- " tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))\n",
- "\n",
- " return outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wgbmZr3sgbWW"
- },
- "source": [
- "Now, we pass in this feature engineering code to the `Transform` component and run it to transform your data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "jHfhth_GiZI9"
- },
- "outputs": [],
- "source": [
- "transform = tfx.components.Transform(\n",
- " examples=example_gen.outputs['examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " module_file=os.path.abspath(_taxi_transform_module_file))\n",
- "context.run(transform, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fwAwb4rARRQ2"
- },
- "source": [
- "Let's examine the output artifacts of `Transform`. This component produces two types of outputs:\n",
- "\n",
- "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
- "* `transformed_examples` represents the preprocessed training and evaluation data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "SClrAaEGR1O5"
- },
- "outputs": [],
- "source": [
- "transform.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "vyFkBd9AR1sy"
- },
- "source": [
- "Take a peek at the `transform_graph` artifact. It points to a directory containing three subdirectories."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "5tRw4DneR3i7"
- },
- "outputs": [],
- "source": [
- "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
- "os.listdir(train_uri)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "4fqV54CIR6Pu"
- },
- "source": [
- "The `transformed_metadata` subdirectory contains the schema of the preprocessed data. The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data.\n",
- "\n",
- "We can also take a look at the first three transformed examples:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pwbW2zPKR_S4"
- },
- "outputs": [],
- "source": [
- "# Get the URI of the output artifact representing the transformed examples, which is a directory\n",
- "train_uri = os.path.join(transform.outputs['transformed_examples'].get()[0].uri, 'Split-train')\n",
- "\n",
- "# Get the list of files in this directory (all compressed TFRecord files)\n",
- "tfrecord_filenames = [os.path.join(train_uri, name)\n",
- " for name in os.listdir(train_uri)]\n",
- "\n",
- "# Create a `TFRecordDataset` to read these files\n",
- "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- "\n",
- "# Iterate over the first 3 records and decode them.\n",
- "for tfrecord in dataset.take(3):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "q_b_V6eN4f69"
- },
- "source": [
- "After the `Transform` component has transformed your data into features, and the next step is to train a model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OBJFtnl6lCg9"
- },
- "source": [
- "### Trainer\n",
- "The `Trainer` component will train a model that you define in TensorFlow.\n",
- "\n",
- "`Trainer` takes as input the schema from `SchemaGen`, the transformed data and graph from `Transform`, training parameters, as well as a module that contains user-defined model code.\n",
- "\n",
- "Let's see an example of user-defined model code below (for an introduction to the TensorFlow Keras APIs, [see the tutorial](https://www.tensorflow.org/guide/keras)):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "N1376oq04YJt"
- },
- "outputs": [],
- "source": [
- "_taxi_trainer_module_file = 'taxi_trainer.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "nf9UuNng4YJu"
- },
- "outputs": [],
- "source": [
- "%%writefile {_taxi_trainer_module_file}\n",
- "\n",
- "from typing import Dict, List, Text\n",
- "\n",
- "import os\n",
- "import glob\n",
- "from absl import logging\n",
- "\n",
- "import datetime\n",
- "import tensorflow as tf\n",
- "import tensorflow_transform as tft\n",
- "\n",
- "from tfx import v1 as tfx\n",
- "from tfx_bsl.public import tfxio\n",
- "from tensorflow_transform import TFTransformOutput\n",
- "\n",
- "# Imported files such as taxi_constants are normally cached, so changes are\n",
- "# not honored after the first import. Normally this is good for efficiency, but\n",
- "# during development when we may be iterating code it can be a problem. To\n",
- "# avoid this problem during development, reload the file.\n",
- "import taxi_constants\n",
- "import sys\n",
- "if 'google.colab' in sys.modules: # Testing to see if we're doing development\n",
- " import importlib\n",
- " importlib.reload(taxi_constants)\n",
- "\n",
- "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
- "\n",
- "_BATCH_SIZE = 40\n",
- "\n",
- "\n",
- "def _input_fn(file_pattern: List[Text],\n",
- " data_accessor: tfx.components.DataAccessor,\n",
- " tf_transform_output: tft.TFTransformOutput,\n",
- " batch_size: int = 200) -\u003e tf.data.Dataset:\n",
- " \"\"\"Generates features and label for tuning/training.\n",
- "\n",
- " Args:\n",
- " file_pattern: List of paths or patterns of input tfrecord files.\n",
- " data_accessor: DataAccessor for converting input to RecordBatch.\n",
- " tf_transform_output: A TFTransformOutput.\n",
- " batch_size: representing the number of consecutive elements of returned\n",
- " dataset to combine in a single batch\n",
- "\n",
- " Returns:\n",
- " A dataset that contains (features, indices) tuple where features is a\n",
- " dictionary of Tensors, and indices is a single Tensor of label indices.\n",
- " \"\"\"\n",
- " return data_accessor.tf_dataset_factory(\n",
- " file_pattern,\n",
- " tfxio.TensorFlowDatasetOptions(\n",
- " batch_size=batch_size, label_key=_LABEL_KEY),\n",
- " tf_transform_output.transformed_metadata.schema)\n",
- "\n",
- "def _get_tf_examples_serving_signature(model, tf_transform_output):\n",
- " \"\"\"Returns a serving signature that accepts `tensorflow.Example`.\"\"\"\n",
- "\n",
- " # We need to track the layers in the model in order to save it.\n",
- " # TODO(b/162357359): Revise once the bug is resolved.\n",
- " model.tft_layer_inference = tf_transform_output.transform_features_layer()\n",
- "\n",
- " @tf.function(input_signature=[\n",
- " tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')\n",
- " ])\n",
- " def serve_tf_examples_fn(serialized_tf_example):\n",
- " \"\"\"Returns the output to be used in the serving signature.\"\"\"\n",
- " raw_feature_spec = tf_transform_output.raw_feature_spec()\n",
- " # Remove label feature since these will not be present at serving time.\n",
- " raw_feature_spec.pop(_LABEL_KEY)\n",
- " raw_features = tf.io.parse_example(serialized_tf_example, raw_feature_spec)\n",
- " transformed_features = model.tft_layer_inference(raw_features)\n",
- " logging.info('serve_transformed_features = %s', transformed_features)\n",
- "\n",
- " outputs = model(transformed_features)\n",
- " # TODO(b/154085620): Convert the predicted labels from the model using a\n",
- " # reverse-lookup (opposite of transform.py).\n",
- " return {'outputs': outputs}\n",
- "\n",
- " return serve_tf_examples_fn\n",
- "\n",
- "\n",
- "def _get_transform_features_signature(model, tf_transform_output):\n",
- " \"\"\"Returns a serving signature that applies tf.Transform to features.\"\"\"\n",
- "\n",
- " # We need to track the layers in the model in order to save it.\n",
- " # TODO(b/162357359): Revise once the bug is resolved.\n",
- " model.tft_layer_eval = tf_transform_output.transform_features_layer()\n",
- "\n",
- " @tf.function(input_signature=[\n",
- " tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')\n",
- " ])\n",
- " def transform_features_fn(serialized_tf_example):\n",
- " \"\"\"Returns the transformed_features to be fed as input to evaluator.\"\"\"\n",
- " raw_feature_spec = tf_transform_output.raw_feature_spec()\n",
- " raw_features = tf.io.parse_example(serialized_tf_example, raw_feature_spec)\n",
- " transformed_features = model.tft_layer_eval(raw_features)\n",
- " logging.info('eval_transformed_features = %s', transformed_features)\n",
- " return transformed_features\n",
- "\n",
- " return transform_features_fn\n",
- "\n",
- "\n",
- "def export_serving_model(tf_transform_output, model, output_dir):\n",
- " \"\"\"Exports a keras model for serving.\n",
- " Args:\n",
- " tf_transform_output: Wrapper around output of tf.Transform.\n",
- " model: A keras model to export for serving.\n",
- " output_dir: A directory where the model will be exported to.\n",
- " \"\"\"\n",
- " # The layer has to be saved to the model for keras tracking purpases.\n",
- " model.tft_layer = tf_transform_output.transform_features_layer()\n",
- "\n",
- " signatures = {\n",
- " 'serving_default':\n",
- " _get_tf_examples_serving_signature(model, tf_transform_output),\n",
- " 'transform_features':\n",
- " _get_transform_features_signature(model, tf_transform_output),\n",
- " }\n",
- "\n",
- " model.save(output_dir, save_format='tf', signatures=signatures)\n",
- "\n",
- "\n",
- "def _build_keras_model(tf_transform_output: TFTransformOutput\n",
- " ) -\u003e tf.keras.Model:\n",
- " \"\"\"Creates a DNN Keras model for classifying taxi data.\n",
- "\n",
- " Args:\n",
- " tf_transform_output: [TFTransformOutput], the outputs from Transform\n",
- "\n",
- " Returns:\n",
- " A keras Model.\n",
- " \"\"\"\n",
- " feature_spec = tf_transform_output.transformed_feature_spec().copy()\n",
- " feature_spec.pop(_LABEL_KEY)\n",
- "\n",
- " inputs = {}\n",
- " for key, spec in feature_spec.items():\n",
- " if isinstance(spec, tf.io.VarLenFeature):\n",
- " inputs[key] = tf.keras.layers.Input(\n",
- " shape=[None], name=key, dtype=spec.dtype, sparse=True)\n",
- " elif isinstance(spec, tf.io.FixedLenFeature):\n",
- " # TODO(b/208879020): Move into schema such that spec.shape is [1] and not\n",
- " # [] for scalars.\n",
- " inputs[key] = tf.keras.layers.Input(\n",
- " shape=spec.shape or [1], name=key, dtype=spec.dtype)\n",
- " else:\n",
- " raise ValueError('Spec type is not supported: ', key, spec)\n",
- "\n",
- " output = tf.keras.layers.Concatenate()(tf.nest.flatten(inputs))\n",
- " output = tf.keras.layers.Dense(100, activation='relu')(output)\n",
- " output = tf.keras.layers.Dense(70, activation='relu')(output)\n",
- " output = tf.keras.layers.Dense(50, activation='relu')(output)\n",
- " output = tf.keras.layers.Dense(20, activation='relu')(output)\n",
- " output = tf.keras.layers.Dense(1)(output)\n",
- " return tf.keras.Model(inputs=inputs, outputs=output)\n",
- "\n",
- "\n",
- "# TFX Trainer will call this function.\n",
- "def run_fn(fn_args: tfx.components.FnArgs):\n",
- " \"\"\"Train the model based on given args.\n",
- "\n",
- " Args:\n",
- " fn_args: Holds args used to train the model as name/value pairs.\n",
- " \"\"\"\n",
- " tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)\n",
- "\n",
- " train_dataset = _input_fn(fn_args.train_files, fn_args.data_accessor,\n",
- " tf_transform_output, _BATCH_SIZE)\n",
- " eval_dataset = _input_fn(fn_args.eval_files, fn_args.data_accessor,\n",
- " tf_transform_output, _BATCH_SIZE)\n",
- "\n",
- " model = _build_keras_model(tf_transform_output)\n",
- "\n",
- " model.compile(\n",
- " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
- " optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),\n",
- " metrics=[tf.keras.metrics.BinaryAccuracy()])\n",
- "\n",
- " tensorboard_callback = tf.keras.callbacks.TensorBoard(\n",
- " log_dir=fn_args.model_run_dir, update_freq='batch')\n",
- "\n",
- " model.fit(\n",
- " train_dataset,\n",
- " steps_per_epoch=fn_args.train_steps,\n",
- " validation_data=eval_dataset,\n",
- " validation_steps=fn_args.eval_steps,\n",
- " callbacks=[tensorboard_callback])\n",
- "\n",
- " # Export the model.\n",
- " export_serving_model(tf_transform_output, model, fn_args.serving_model_dir)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GY4yTRaX4YJx"
- },
- "source": [
- "Now, we pass in this model code to the `Trainer` component and run it to train the model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "429-vvCWibO0"
- },
- "outputs": [],
- "source": [
- "trainer = tfx.components.Trainer(\n",
- " module_file=os.path.abspath(_taxi_trainer_module_file),\n",
- " examples=transform.outputs['transformed_examples'],\n",
- " transform_graph=transform.outputs['transform_graph'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " train_args=tfx.proto.TrainArgs(num_steps=10000),\n",
- " eval_args=tfx.proto.EvalArgs(num_steps=5000))\n",
- "context.run(trainer, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "6Cql1G35StJp"
- },
- "source": [
- "#### Analyze Training with TensorBoard\n",
- "Take a peek at the trainer artifact. It points to a directory containing the model subdirectories."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "bXe62WE0S0Ek"
- },
- "outputs": [],
- "source": [
- "model_artifact_dir = trainer.outputs['model'].get()[0].uri\n",
- "pp.pprint(os.listdir(model_artifact_dir))\n",
- "model_dir = os.path.join(model_artifact_dir, 'Format-Serving')\n",
- "pp.pprint(os.listdir(model_dir))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "DfjOmSro6Q3Y"
- },
- "source": [
- "Optionally, we can connect TensorBoard to the Trainer to analyze our model's training curves."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "-APzqz2NeAyj"
- },
- "outputs": [],
- "source": [
- "model_run_artifact_dir = trainer.outputs['model_run'].get()[0].uri\n",
- "\n",
- "%load_ext tensorboard\n",
- "%tensorboard --logdir {model_run_artifact_dir}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FmPftrv0lEQy"
- },
- "source": [
- "### Evaluator\n",
- "The `Evaluator` component computes model performance metrics over the evaluation set. It uses the [TensorFlow Model Analysis](https://www.tensorflow.org/tfx/model_analysis/get_started) library. The `Evaluator` can also optionally validate that a newly trained model is better than the previous model. This is useful in a production pipeline setting where you may automatically train and validate a model every day. In this notebook, we only train one model, so the `Evaluator` automatically will label the\n",
- "model as \"good\".\n",
- "\n",
- "`Evaluator` will take as input the data from `ExampleGen`, the trained model from `Trainer`, and slicing configuration. The slicing configuration allows you to slice your metrics on feature values (e.g. how does your model perform on taxi trips that start at 8am versus 8pm?). See an example of this configuration below:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "fVhfzzh9PDEx"
- },
- "outputs": [],
- "source": [
- "# Imported files such as taxi_constants are normally cached, so changes are\n",
- "# not honored after the first import. Normally this is good for efficiency, but\n",
- "# during development when we may be iterating code it can be a problem. To\n",
- "# avoid this problem during development, reload the file.\n",
- "import taxi_constants\n",
- "import sys\n",
- "if 'google.colab' in sys.modules: # Testing to see if we're doing development\n",
- " import importlib\n",
- " importlib.reload(taxi_constants)\n",
- "\n",
- "eval_config = tfma.EvalConfig(\n",
- " model_specs=[\n",
- " # This assumes a serving model with signature 'serving_default'. If\n",
- " # using estimator based EvalSavedModel, add signature_name: 'eval' and\n",
- " # remove the label_key.\n",
- " tfma.ModelSpec(\n",
- " signature_name='serving_default',\n",
- " label_key=taxi_constants.LABEL_KEY,\n",
- " preprocessing_function_names=['transform_features'],\n",
- " )\n",
- " ],\n",
- " metrics_specs=[\n",
- " tfma.MetricsSpec(\n",
- " # The metrics added here are in addition to those saved with the\n",
- " # model (assuming either a keras model or EvalSavedModel is used).\n",
- " # Any metrics added into the saved model (for example using\n",
- " # model.compile(..., metrics=[...]), etc) will be computed\n",
- " # automatically.\n",
- " # To add validation thresholds for metrics saved with the model,\n",
- " # add them keyed by metric name to the thresholds map.\n",
- " metrics=[\n",
- " tfma.MetricConfig(class_name='ExampleCount'),\n",
- " tfma.MetricConfig(class_name='BinaryAccuracy',\n",
- " threshold=tfma.MetricThreshold(\n",
- " value_threshold=tfma.GenericValueThreshold(\n",
- " lower_bound={'value': 0.5}),\n",
- " # Change threshold will be ignored if there is no\n",
- " # baseline model resolved from MLMD (first run).\n",
- " change_threshold=tfma.GenericChangeThreshold(\n",
- " direction=tfma.MetricDirection.HIGHER_IS_BETTER,\n",
- " absolute={'value': -1e-10})))\n",
- " ]\n",
- " )\n",
- " ],\n",
- " slicing_specs=[\n",
- " # An empty slice spec means the overall slice, i.e. the whole dataset.\n",
- " tfma.SlicingSpec(),\n",
- " # Data can be sliced along a feature column. In this case, data is\n",
- " # sliced along feature column trip_start_hour.\n",
- " tfma.SlicingSpec(\n",
- " feature_keys=['trip_start_hour'])\n",
- " ])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "9mBdKH1F8JuT"
- },
- "source": [
- "Next, we give this configuration to `Evaluator` and run it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Zjcx8g6mihSt"
- },
- "outputs": [],
- "source": [
- "# Use TFMA to compute a evaluation statistics over features of a model and\n",
- "# validate them against a baseline.\n",
- "\n",
- "# The model resolver is only required if performing model validation in addition\n",
- "# to evaluation. In this case we validate against the latest blessed model. If\n",
- "# no model has been blessed before (as in this case) the evaluator will make our\n",
- "# candidate the first blessed model.\n",
- "model_resolver = tfx.dsl.Resolver(\n",
- " strategy_class=tfx.dsl.experimental.LatestBlessedModelStrategy,\n",
- " model=tfx.dsl.Channel(type=tfx.types.standard_artifacts.Model),\n",
- " model_blessing=tfx.dsl.Channel(\n",
- " type=tfx.types.standard_artifacts.ModelBlessing)).with_id(\n",
- " 'latest_blessed_model_resolver')\n",
- "context.run(model_resolver, enable_cache=True)\n",
- "\n",
- "evaluator = tfx.components.Evaluator(\n",
- " examples=example_gen.outputs['examples'],\n",
- " model=trainer.outputs['model'],\n",
- " baseline_model=model_resolver.outputs['model'],\n",
- " eval_config=eval_config)\n",
- "context.run(evaluator, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "AeCVkBusS_8g"
- },
- "source": [
- "Now let's examine the output artifacts of `Evaluator`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "k4GghePOTJxL"
- },
- "outputs": [],
- "source": [
- "evaluator.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Y5TMskWe9LL0"
- },
- "source": [
- "Using the `evaluation` output we can show the default visualization of global metrics on the entire evaluation set."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "U729j5X5QQUQ"
- },
- "outputs": [],
- "source": [
- "context.show(evaluator.outputs['evaluation'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "t-tI4p6m-OAn"
- },
- "source": [
- "To see the visualization for sliced evaluation metrics, we can directly call the TensorFlow Model Analysis library."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pyis6iy0HLdi"
- },
- "outputs": [],
- "source": [
- "import tensorflow_model_analysis as tfma\n",
- "\n",
- "# Get the TFMA output result path and load the result.\n",
- "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
- "tfma_result = tfma.load_eval_result(PATH_TO_RESULT)\n",
- "\n",
- "# Show data sliced along feature column trip_start_hour.\n",
- "tfma.view.render_slicing_metrics(\n",
- " tfma_result, slicing_column='trip_start_hour')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "7uvYrUf2-r_6"
- },
- "source": [
- "This visualization shows the same metrics, but computed at every feature value of `trip_start_hour` instead of on the entire evaluation set.\n",
- "\n",
- "TensorFlow Model Analysis supports many other visualizations, such as Fairness Indicators and plotting a time series of model performance. To learn more, see [the tutorial](/tutorials/model_analysis/tfma_basic)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TEotnkxEswUb"
- },
- "source": [
- "Since we added thresholds to our config, validation output is also available. The precence of a `blessing` artifact indicates that our model passed validation. Since this is the first validation being performed the candidate is automatically blessed."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "FZmiRtg6TKtR"
- },
- "outputs": [],
- "source": [
- "blessing_uri = evaluator.outputs['blessing'].get()[0].uri\n",
- "!ls -l {blessing_uri}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "hM1tFkOVSBa0"
- },
- "source": [
- "Now can also verify the success by loading the validation result record:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "lxa5G08bSJ8a"
- },
- "outputs": [],
- "source": [
- "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
- "print(tfma.load_validation_result(PATH_TO_RESULT))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "T8DYekCZlHfj"
- },
- "source": [
- "### Pusher\n",
- "The `Pusher` component is usually at the end of a TFX pipeline. It checks whether a model has passed validation, and if so, exports the model to `_serving_model_dir`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "r45nQ69eikc9"
- },
- "outputs": [],
- "source": [
- "pusher = tfx.components.Pusher(\n",
- " model=trainer.outputs['model'],\n",
- " model_blessing=evaluator.outputs['blessing'],\n",
- " push_destination=tfx.proto.PushDestination(\n",
- " filesystem=tfx.proto.PushDestination.Filesystem(\n",
- " base_directory=_serving_model_dir)))\n",
- "context.run(pusher, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ctUErBYoTO9I"
- },
- "source": [
- "Let's examine the output artifacts of `Pusher`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pRkWo-MzTSss"
- },
- "outputs": [],
- "source": [
- "pusher.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "peH2PPS3VgkL"
- },
- "source": [
- "In particular, the Pusher will export your model in the SavedModel format, which looks like this:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4zyIqWl9TSdG"
- },
- "outputs": [],
- "source": [
- "push_uri = pusher.outputs['pushed_model'].get()[0].uri\n",
- "model = tf.saved_model.load(push_uri)\n",
- "\n",
- "for item in model.signatures.items():\n",
- " pp.pprint(item)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3-YPNUuHANtj"
- },
- "source": [
- "We're finished our tour of built-in TFX components!"
- ]
- }
- ],
- "metadata": {
- "accelerator": "GPU",
- "colab": {
- "collapsed_sections": [
- "wdeKOEkv1Fe8"
- ],
- "name": "components_keras.ipynb",
- "private_outputs": true,
- "provenance": [],
- "toc_visible": true
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KAD1tLoTm_QS"
+ },
+ "source": [
+ "\n",
+ "This Colab-based tutorial will interactively walk through each built-in component of TensorFlow Extended (TFX).\n",
+ "\n",
+ "It covers every step in an end-to-end machine learning pipeline, from data ingestion to pushing a model to serving.\n",
+ "\n",
+ "When you're done, the contents of this notebook can be automatically exported as TFX pipeline source code, which you can orchestrate with Apache Airflow and Apache Beam.\n",
+ "\n",
+ "Note: This notebook demonstrates the use of native Keras models in TFX pipelines. **TFX only supports the TensorFlow 2 version of Keras**."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sfSQ-kX-MLEr"
+ },
+ "source": [
+ "## Background\n",
+ "This notebook demonstrates how to use TFX in a Jupyter/Colab environment. Here, we walk through the Chicago Taxi example in an interactive notebook.\n",
+ "\n",
+ "Working in an interactive notebook is a useful way to become familiar with the structure of a TFX pipeline. It's also useful when doing development of your own pipelines as a lightweight development environment, but you should be aware that there are differences in the way interactive notebooks are orchestrated, and how they access metadata artifacts.\n",
+ "\n",
+ "### Orchestration\n",
+ "\n",
+ "In a production deployment of TFX, you will use an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. In an interactive notebook, the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells.\n",
+ "\n",
+ "### Metadata\n",
+ "\n",
+ "In a production deployment of TFX, you will access metadata through the ML Metadata (MLMD) API. MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. In an interactive notebook, both properties and payloads are stored in an ephemeral SQLite database in the `/tmp` directory on the Jupyter notebook or Colab server."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2GivNBNYjb3b"
+ },
+ "source": [
+ "## Setup\n",
+ "First, we install and import the necessary packages, set up paths, and download data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Fmgi8ZvQkScg"
+ },
+ "source": [
+ "### Upgrade Pip\n",
+ "\n",
+ "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "as4OTe2ukSqm"
+ },
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "if 'google.colab' in sys.modules:\n",
+ " !pip install --upgrade pip"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MZOYTt1RW4TK"
+ },
+ "source": [
+ "### Install TFX\n",
+ "\n",
+ "**Note: In Google Colab, because of package updates, the first time you run this cell you must restart the runtime (Runtime > Restart runtime ...).**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "S4SQA7Q5nej3"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install tfx"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EwT0nov5QO1M"
+ },
+ "source": [
+ "## Did you restart the runtime?\n",
+ "\n",
+ "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime > Restart runtime ...). This is because of the way that Colab loads packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N-ePgV0Lj68Q"
+ },
+ "source": [
+ "### Import packages\n",
+ "We import necessary packages, including standard TFX component classes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "YIqpWK9efviJ"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import pprint\n",
+ "import tempfile\n",
+ "import urllib\n",
+ "\n",
+ "import absl\n",
+ "import tensorflow as tf\n",
+ "import tensorflow_model_analysis as tfma\n",
+ "tf.get_logger().propagate = False\n",
+ "pp = pprint.PrettyPrinter()\n",
+ "\n",
+ "from tfx import v1 as tfx\n",
+ "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
+ "\n",
+ "%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wCZTHRy0N1D6"
+ },
+ "source": [
+ "Let's check the library versions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "eZ4K18_DN2D8"
+ },
+ "outputs": [],
+ "source": [
+ "print('TensorFlow version: {}'.format(tf.__version__))\n",
+ "print('TFX version: {}'.format(tfx.__version__))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ufJKQ6OvkJlY"
+ },
+ "source": [
+ "### Set up pipeline paths"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ad5JLpKbf6sN"
+ },
+ "outputs": [],
+ "source": [
+ "# This is the root directory for your TFX pip package installation.\n",
+ "_tfx_root = tfx.__path__[0]\n",
+ "\n",
+ "# This is the directory containing the TFX Chicago Taxi Pipeline example.\n",
+ "_taxi_root = os.path.join(_tfx_root, 'examples/chicago_taxi_pipeline')\n",
+ "\n",
+ "# This is the path where your model will be pushed for serving.\n",
+ "_serving_model_dir = os.path.join(\n",
+ " tempfile.mkdtemp(), 'serving_model/taxi_simple')\n",
+ "\n",
+ "# Set up logging.\n",
+ "absl.logging.set_verbosity(absl.logging.INFO)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "n2cMMAbSkGfX"
+ },
+ "source": [
+ "### Download example data\n",
+ "We download the example dataset for use in our TFX pipeline.\n",
+ "\n",
+ "The dataset we're using is the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) released by the City of Chicago. The columns in this dataset are:\n",
+ "\n",
+ "\n",
+ "pickup_community_area | fare | trip_start_month |
\n",
+ "trip_start_hour | trip_start_day | trip_start_timestamp |
\n",
+ "pickup_latitude | pickup_longitude | dropoff_latitude |
\n",
+ "dropoff_longitude | trip_miles | pickup_census_tract |
\n",
+ "dropoff_census_tract | payment_type | company |
\n",
+ "trip_seconds | dropoff_community_area | tips |
\n",
+ "
\n",
+ "\n",
+ "With this dataset, we will build a model that predicts the `tips` of a trip."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "BywX6OUEhAqn"
+ },
+ "outputs": [],
+ "source": [
+ "_data_root = tempfile.mkdtemp(prefix='tfx-data')\n",
+ "DATA_PATH = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv'\n",
+ "_data_filepath = os.path.join(_data_root, \"data.csv\")\n",
+ "urllib.request.urlretrieve(DATA_PATH, _data_filepath)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "blZC1sIQOWfH"
+ },
+ "source": [
+ "Take a quick look at the CSV file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "c5YPeLPFOXaD"
+ },
+ "outputs": [],
+ "source": [
+ "!head {_data_filepath}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QioyhunCImwE"
+ },
+ "source": [
+ "*Disclaimer: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8ONIE_hdkPS4"
+ },
+ "source": [
+ "### Create the InteractiveContext\n",
+ "Last, we create an InteractiveContext, which will allow us to run TFX components interactively in this notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "0Rh6K5sUf9dd"
+ },
+ "outputs": [],
+ "source": [
+ "# Here, we create an InteractiveContext using default parameters. This will\n",
+ "# use a temporary directory with an ephemeral ML Metadata database instance.\n",
+ "# To use your own pipeline root or database, the optional properties\n",
+ "# `pipeline_root` and `metadata_connection_config` may be passed to\n",
+ "# InteractiveContext. Calls to InteractiveContext are no-ops outside of the\n",
+ "# notebook.\n",
+ "context = InteractiveContext()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HdQWxfsVkzdJ"
+ },
+ "source": [
+ "## Run TFX components interactively\n",
+ "In the cells that follow, we create TFX components one-by-one, run each of them, and visualize their output artifacts."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L9fwt9gQk3BR"
+ },
+ "source": [
+ "### ExampleGen\n",
+ "\n",
+ "The `ExampleGen` component is usually at the start of a TFX pipeline. It will:\n",
+ "\n",
+ "1. Split data into training and evaluation sets (by default, 2/3 training + 1/3 eval)\n",
+ "2. Convert data into the `tf.Example` format (learn more [here](https://www.tensorflow.org/tutorials/load_data/tfrecord))\n",
+ "3. Copy data into the `_tfx_root` directory for other components to access\n",
+ "\n",
+ "`ExampleGen` takes as input the path to your data source. In our case, this is the `_data_root` path that contains the downloaded CSV.\n",
+ "\n",
+ "Note: In this notebook, we can instantiate components one-by-one and run them with `InteractiveContext.run()`. By contrast, in a production setting, we would specify all the components upfront in a `Pipeline` to pass to the orchestrator (see the [Building a TFX Pipeline Guide](../../../guide/build_tfx_pipeline)).\n",
+ "\n",
+ "#### Enabling the Cache\n",
+ "When using the `InteractiveContext` in a notebook to develop a pipeline you can control when individual components will cache their outputs. Set `enable_cache` to `True` when you want to reuse the previous output artifacts that the component generated. Set `enable_cache` to `False` when you want to recompute the output artifacts for a component, if you are making changes to the code for example."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "PyXjuMt8f-9u"
+ },
+ "outputs": [],
+ "source": [
+ "example_gen = tfx.components.CsvExampleGen(input_base=_data_root)\n",
+ "context.run(example_gen, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OqCoZh7KPUm9"
+ },
+ "source": [
+ "Let's examine the output artifacts of `ExampleGen`. This component produces two artifacts, training examples and evaluation examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "880KkTAkPeUg"
+ },
+ "outputs": [],
+ "source": [
+ "artifact = example_gen.outputs['examples'].get()[0]\n",
+ "print(artifact.split_names, artifact.uri)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J6vcbW_wPqvl"
+ },
+ "source": [
+ "We can also take a look at the first three training examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "H4XIXjiCPwzQ"
+ },
+ "outputs": [],
+ "source": [
+ "# Get the URI of the output artifact representing the training examples, which is a directory\n",
+ "train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')\n",
+ "\n",
+ "# Get the list of files in this directory (all compressed TFRecord files)\n",
+ "tfrecord_filenames = [os.path.join(train_uri, name)\n",
+ " for name in os.listdir(train_uri)]\n",
+ "\n",
+ "# Create a `TFRecordDataset` to read these files\n",
+ "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ "\n",
+ "# Iterate over the first 3 records and decode them.\n",
+ "for tfrecord in dataset.take(3):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2gluYjccf-IP"
+ },
+ "source": [
+ "Now that `ExampleGen` has finished ingesting the data, the next step is data analysis."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "csM6BFhtk5Aa"
+ },
+ "source": [
+ "### StatisticsGen\n",
+ "The `StatisticsGen` component computes statistics over your dataset for data analysis, as well as for use in downstream components. It uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "`StatisticsGen` takes as input the dataset we just ingested using `ExampleGen`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MAscCCYWgA-9"
+ },
+ "outputs": [],
+ "source": [
+ "statistics_gen = tfx.components.StatisticsGen(\n",
+ " examples=example_gen.outputs['examples'])\n",
+ "context.run(statistics_gen, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HLI6cb_5WugZ"
+ },
+ "source": [
+ "After `StatisticsGen` finishes running, we can visualize the outputted statistics. Try playing with the different plots!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tLjXy7K6Tp_G"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(statistics_gen.outputs['statistics'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HLKLTO9Nk60p"
+ },
+ "source": [
+ "### SchemaGen\n",
+ "\n",
+ "The `SchemaGen` component generates a schema based on your data statistics. (A schema defines the expected bounds, types, and properties of the features in your dataset.) It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "Note: The generated schema is best-effort and only tries to infer basic properties of the data. It is expected that you review and modify it as needed.\n",
+ "\n",
+ "`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ygQvZ6hsiQ_J"
+ },
+ "outputs": [],
+ "source": [
+ "schema_gen = tfx.components.SchemaGen(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " infer_feature_shape=False)\n",
+ "context.run(schema_gen, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zi6TxTUKXM6b"
+ },
+ "source": [
+ "After `SchemaGen` finishes running, we can visualize the generated schema as a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Ec9vqDXpXeMb"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(schema_gen.outputs['schema'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kZWWdbA-m7zp"
+ },
+ "source": [
+ "Each feature in your dataset shows up as a row in the schema table, alongside its properties. The schema also captures all the values that a categorical feature takes on, denoted as its domain.\n",
+ "\n",
+ "To learn more about schemas, see [the SchemaGen documentation](../../../guide/schemagen)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "V1qcUuO9k9f8"
+ },
+ "source": [
+ "### ExampleValidator\n",
+ "The `ExampleValidator` component detects anomalies in your data, based on the expectations defined by the schema. It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "`ExampleValidator` will take as input the statistics from `StatisticsGen`, and the schema from `SchemaGen`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XRlRUuGgiXks"
+ },
+ "outputs": [],
+ "source": [
+ "example_validator = tfx.components.ExampleValidator(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " schema=schema_gen.outputs['schema'])\n",
+ "context.run(example_validator, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "855mrHgJcoer"
+ },
+ "source": [
+ "After `ExampleValidator` finishes running, we can visualize the anomalies as a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "TDyAAozQcrk3"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(example_validator.outputs['anomalies'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "znMoJj60ybZx"
+ },
+ "source": [
+ "In the anomalies table, we can see that there are no anomalies. This is what we'd expect, since this the first dataset that we've analyzed and the schema is tailored to it. You should review this schema -- anything unexpected means an anomaly in the data. Once reviewed, the schema can be used to guard future data, and anomalies produced here can be used to debug model performance, understand how your data evolves over time, and identify data errors."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JPViEz5RlA36"
+ },
+ "source": [
+ "### Transform\n",
+ "The `Transform` component performs feature engineering for both training and serving. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library.\n",
+ "\n",
+ "`Transform` will take as input the data from `ExampleGen`, the schema from `SchemaGen`, as well as a module that contains user-defined Transform code.\n",
+ "\n",
+ "Let's see an example of user-defined Transform code below (for an introduction to the TensorFlow Transform APIs, [see the tutorial](/tutorials/transform/simple)). First, we define a few constants for feature engineering:\n",
+ "\n",
+ "Note: The `%%writefile` cell magic will save the contents of the cell as a `.py` file on disk. This allows the `Transform` component to load your code as a module.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "PuNSiUKb4YJf"
+ },
+ "outputs": [],
+ "source": [
+ "_taxi_constants_module_file = 'taxi_constants.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "HPjhXuIF4YJh"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_taxi_constants_module_file}\n",
+ "\n",
+ "NUMERICAL_FEATURES = ['trip_miles', 'fare', 'trip_seconds']\n",
+ "\n",
+ "BUCKET_FEATURES = [\n",
+ " 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',\n",
+ " 'dropoff_longitude'\n",
+ "]\n",
+ "# Number of buckets used by tf.transform for encoding each feature.\n",
+ "FEATURE_BUCKET_COUNT = 10\n",
+ "\n",
+ "CATEGORICAL_NUMERICAL_FEATURES = [\n",
+ " 'trip_start_hour', 'trip_start_day', 'trip_start_month',\n",
+ " 'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',\n",
+ " 'dropoff_community_area'\n",
+ "]\n",
+ "\n",
+ "CATEGORICAL_STRING_FEATURES = [\n",
+ " 'payment_type',\n",
+ " 'company',\n",
+ "]\n",
+ "\n",
+ "# Number of vocabulary terms used for encoding categorical features.\n",
+ "VOCAB_SIZE = 1000\n",
+ "\n",
+ "# Count of out-of-vocab buckets in which unrecognized categorical are hashed.\n",
+ "OOV_SIZE = 10\n",
+ "\n",
+ "# Keys\n",
+ "LABEL_KEY = 'tips'\n",
+ "FARE_KEY = 'fare'\n",
+ "\n",
+ "def t_name(key):\n",
+ " \"\"\"\n",
+ " Rename the feature keys so that they don't clash with the raw keys when\n",
+ " running the Evaluator component.\n",
+ " Args:\n",
+ " key: The original feature key\n",
+ " Returns:\n",
+ " key with '_xf' appended\n",
+ " \"\"\"\n",
+ " return key + '_xf'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Duj2Ax5z4YJl"
+ },
+ "source": [
+ "Next, we write a `preprocessing_fn` that takes in raw data as input, and returns transformed features that our model can train on:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4AJ9hBs94YJm"
+ },
+ "outputs": [],
+ "source": [
+ "_taxi_transform_module_file = 'taxi_transform.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MYmxxx9A4YJn"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_taxi_transform_module_file}\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "import tensorflow_transform as tft\n",
+ "\n",
+ "# Imported files such as taxi_constants are normally cached, so changes are\n",
+ "# not honored after the first import. Normally this is good for efficiency, but\n",
+ "# during development when we may be iterating code it can be a problem. To\n",
+ "# avoid this problem during development, reload the file.\n",
+ "import taxi_constants\n",
+ "import sys\n",
+ "if 'google.colab' in sys.modules: # Testing to see if we're doing development\n",
+ " import importlib\n",
+ " importlib.reload(taxi_constants)\n",
+ "\n",
+ "_NUMERICAL_FEATURES = taxi_constants.NUMERICAL_FEATURES\n",
+ "_BUCKET_FEATURES = taxi_constants.BUCKET_FEATURES\n",
+ "_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT\n",
+ "_CATEGORICAL_NUMERICAL_FEATURES = taxi_constants.CATEGORICAL_NUMERICAL_FEATURES\n",
+ "_CATEGORICAL_STRING_FEATURES = taxi_constants.CATEGORICAL_STRING_FEATURES\n",
+ "_VOCAB_SIZE = taxi_constants.VOCAB_SIZE\n",
+ "_OOV_SIZE = taxi_constants.OOV_SIZE\n",
+ "_FARE_KEY = taxi_constants.FARE_KEY\n",
+ "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
+ "\n",
+ "\n",
+ "def _make_one_hot(x, key):\n",
+ " \"\"\"Make a one-hot tensor to encode categorical features.\n",
+ " Args:\n",
+ " X: A dense tensor\n",
+ " key: A string key for the feature in the input\n",
+ " Returns:\n",
+ " A dense one-hot tensor as a float list\n",
+ " \"\"\"\n",
+ " integerized = tft.compute_and_apply_vocabulary(x,\n",
+ " top_k=_VOCAB_SIZE,\n",
+ " num_oov_buckets=_OOV_SIZE,\n",
+ " vocab_filename=key, name=key)\n",
+ " depth = (\n",
+ " tft.experimental.get_vocabulary_size_by_name(key) + _OOV_SIZE)\n",
+ " one_hot_encoded = tf.one_hot(\n",
+ " integerized,\n",
+ " depth=tf.cast(depth, tf.int32),\n",
+ " on_value=1.0,\n",
+ " off_value=0.0)\n",
+ " return tf.reshape(one_hot_encoded, [-1, depth])\n",
+ "\n",
+ "\n",
+ "def _fill_in_missing(x):\n",
+ " \"\"\"Replace missing values in a SparseTensor.\n",
+ " Fills in missing values of `x` with '' or 0, and converts to a dense tensor.\n",
+ " Args:\n",
+ " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
+ " in the second dimension.\n",
+ " Returns:\n",
+ " A rank 1 tensor where missing values of `x` have been filled in.\n",
+ " \"\"\"\n",
+ " if not isinstance(x, tf.sparse.SparseTensor):\n",
+ " return x\n",
+ "\n",
+ " default_value = '' if x.dtype == tf.string else 0\n",
+ " return tf.squeeze(\n",
+ " tf.sparse.to_dense(\n",
+ " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
+ " default_value),\n",
+ " axis=1)\n",
+ "\n",
+ "\n",
+ "def preprocessing_fn(inputs):\n",
+ " \"\"\"tf.transform's callback function for preprocessing inputs.\n",
+ " Args:\n",
+ " inputs: map from feature keys to raw not-yet-transformed features.\n",
+ " Returns:\n",
+ " Map from string feature key to transformed feature operations.\n",
+ " \"\"\"\n",
+ " outputs = {}\n",
+ " for key in _NUMERICAL_FEATURES:\n",
+ " # If sparse make it dense, setting nan's to 0 or '', and apply zscore.\n",
+ " outputs[taxi_constants.t_name(key)] = tft.scale_to_z_score(\n",
+ " _fill_in_missing(inputs[key]), name=key)\n",
+ "\n",
+ " for key in _BUCKET_FEATURES:\n",
+ " outputs[taxi_constants.t_name(key)] = tf.cast(tft.bucketize(\n",
+ " _fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT, name=key),\n",
+ " dtype=tf.float32)\n",
+ "\n",
+ " for key in _CATEGORICAL_STRING_FEATURES:\n",
+ " outputs[taxi_constants.t_name(key)] = _make_one_hot(_fill_in_missing(inputs[key]), key)\n",
+ "\n",
+ " for key in _CATEGORICAL_NUMERICAL_FEATURES:\n",
+ " outputs[taxi_constants.t_name(key)] = _make_one_hot(tf.strings.strip(\n",
+ " tf.strings.as_string(_fill_in_missing(inputs[key]))), key)\n",
+ "\n",
+ " # Was this passenger a big tipper?\n",
+ " taxi_fare = _fill_in_missing(inputs[_FARE_KEY])\n",
+ " tips = _fill_in_missing(inputs[_LABEL_KEY])\n",
+ " outputs[_LABEL_KEY] = tf.where(\n",
+ " tf.math.is_nan(taxi_fare),\n",
+ " tf.cast(tf.zeros_like(taxi_fare), tf.int64),\n",
+ " # Test if the tip was > 20% of the fare.\n",
+ " tf.cast(\n",
+ " tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))\n",
+ "\n",
+ " return outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wgbmZr3sgbWW"
+ },
+ "source": [
+ "Now, we pass in this feature engineering code to the `Transform` component and run it to transform your data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "jHfhth_GiZI9"
+ },
+ "outputs": [],
+ "source": [
+ "transform = tfx.components.Transform(\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " module_file=os.path.abspath(_taxi_transform_module_file))\n",
+ "context.run(transform, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fwAwb4rARRQ2"
+ },
+ "source": [
+ "Let's examine the output artifacts of `Transform`. This component produces two types of outputs:\n",
+ "\n",
+ "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
+ "* `transformed_examples` represents the preprocessed training and evaluation data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SClrAaEGR1O5"
+ },
+ "outputs": [],
+ "source": [
+ "transform.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vyFkBd9AR1sy"
+ },
+ "source": [
+ "Take a peek at the `transform_graph` artifact. It points to a directory containing three subdirectories."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "5tRw4DneR3i7"
+ },
+ "outputs": [],
+ "source": [
+ "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
+ "os.listdir(train_uri)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4fqV54CIR6Pu"
+ },
+ "source": [
+ "The `transformed_metadata` subdirectory contains the schema of the preprocessed data. The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data.\n",
+ "\n",
+ "We can also take a look at the first three transformed examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pwbW2zPKR_S4"
+ },
+ "outputs": [],
+ "source": [
+ "# Get the URI of the output artifact representing the transformed examples, which is a directory\n",
+ "train_uri = os.path.join(transform.outputs['transformed_examples'].get()[0].uri, 'Split-train')\n",
+ "\n",
+ "# Get the list of files in this directory (all compressed TFRecord files)\n",
+ "tfrecord_filenames = [os.path.join(train_uri, name)\n",
+ " for name in os.listdir(train_uri)]\n",
+ "\n",
+ "# Create a `TFRecordDataset` to read these files\n",
+ "dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ "\n",
+ "# Iterate over the first 3 records and decode them.\n",
+ "for tfrecord in dataset.take(3):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q_b_V6eN4f69"
+ },
+ "source": [
+ "After the `Transform` component has transformed your data into features, and the next step is to train a model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OBJFtnl6lCg9"
+ },
+ "source": [
+ "### Trainer\n",
+ "The `Trainer` component will train a model that you define in TensorFlow.\n",
+ "\n",
+ "`Trainer` takes as input the schema from `SchemaGen`, the transformed data and graph from `Transform`, training parameters, as well as a module that contains user-defined model code.\n",
+ "\n",
+ "Let's see an example of user-defined model code below (for an introduction to the TensorFlow Keras APIs, [see the tutorial](https://www.tensorflow.org/guide/keras)):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "N1376oq04YJt"
+ },
+ "outputs": [],
+ "source": [
+ "_taxi_trainer_module_file = 'taxi_trainer.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "nf9UuNng4YJu"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_taxi_trainer_module_file}\n",
+ "\n",
+ "from typing import Dict, List, Text\n",
+ "\n",
+ "import os\n",
+ "import glob\n",
+ "from absl import logging\n",
+ "\n",
+ "import datetime\n",
+ "import tensorflow as tf\n",
+ "import tensorflow_transform as tft\n",
+ "\n",
+ "from tfx import v1 as tfx\n",
+ "from tfx_bsl.public import tfxio\n",
+ "from tensorflow_transform import TFTransformOutput\n",
+ "\n",
+ "# Imported files such as taxi_constants are normally cached, so changes are\n",
+ "# not honored after the first import. Normally this is good for efficiency, but\n",
+ "# during development when we may be iterating code it can be a problem. To\n",
+ "# avoid this problem during development, reload the file.\n",
+ "import taxi_constants\n",
+ "import sys\n",
+ "if 'google.colab' in sys.modules: # Testing to see if we're doing development\n",
+ " import importlib\n",
+ " importlib.reload(taxi_constants)\n",
+ "\n",
+ "_LABEL_KEY = taxi_constants.LABEL_KEY\n",
+ "\n",
+ "_BATCH_SIZE = 40\n",
+ "\n",
+ "\n",
+ "def _input_fn(file_pattern: List[Text],\n",
+ " data_accessor: tfx.components.DataAccessor,\n",
+ " tf_transform_output: tft.TFTransformOutput,\n",
+ " batch_size: int = 200) -> tf.data.Dataset:\n",
+ " \"\"\"Generates features and label for tuning/training.\n",
+ "\n",
+ " Args:\n",
+ " file_pattern: List of paths or patterns of input tfrecord files.\n",
+ " data_accessor: DataAccessor for converting input to RecordBatch.\n",
+ " tf_transform_output: A TFTransformOutput.\n",
+ " batch_size: representing the number of consecutive elements of returned\n",
+ " dataset to combine in a single batch\n",
+ "\n",
+ " Returns:\n",
+ " A dataset that contains (features, indices) tuple where features is a\n",
+ " dictionary of Tensors, and indices is a single Tensor of label indices.\n",
+ " \"\"\"\n",
+ " return data_accessor.tf_dataset_factory(\n",
+ " file_pattern,\n",
+ " tfxio.TensorFlowDatasetOptions(\n",
+ " batch_size=batch_size, label_key=_LABEL_KEY),\n",
+ " tf_transform_output.transformed_metadata.schema)\n",
+ "\n",
+ "def _get_tf_examples_serving_signature(model, tf_transform_output):\n",
+ " \"\"\"Returns a serving signature that accepts `tensorflow.Example`.\"\"\"\n",
+ "\n",
+ " # We need to track the layers in the model in order to save it.\n",
+ " # TODO(b/162357359): Revise once the bug is resolved.\n",
+ " model.tft_layer_inference = tf_transform_output.transform_features_layer()\n",
+ "\n",
+ " @tf.function(input_signature=[\n",
+ " tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')\n",
+ " ])\n",
+ " def serve_tf_examples_fn(serialized_tf_example):\n",
+ " \"\"\"Returns the output to be used in the serving signature.\"\"\"\n",
+ " raw_feature_spec = tf_transform_output.raw_feature_spec()\n",
+ " # Remove label feature since these will not be present at serving time.\n",
+ " raw_feature_spec.pop(_LABEL_KEY)\n",
+ " raw_features = tf.io.parse_example(serialized_tf_example, raw_feature_spec)\n",
+ " transformed_features = model.tft_layer_inference(raw_features)\n",
+ " logging.info('serve_transformed_features = %s', transformed_features)\n",
+ "\n",
+ " outputs = model(transformed_features)\n",
+ " # TODO(b/154085620): Convert the predicted labels from the model using a\n",
+ " # reverse-lookup (opposite of transform.py).\n",
+ " return {'outputs': outputs}\n",
+ "\n",
+ " return serve_tf_examples_fn\n",
+ "\n",
+ "\n",
+ "def _get_transform_features_signature(model, tf_transform_output):\n",
+ " \"\"\"Returns a serving signature that applies tf.Transform to features.\"\"\"\n",
+ "\n",
+ " # We need to track the layers in the model in order to save it.\n",
+ " # TODO(b/162357359): Revise once the bug is resolved.\n",
+ " model.tft_layer_eval = tf_transform_output.transform_features_layer()\n",
+ "\n",
+ " @tf.function(input_signature=[\n",
+ " tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')\n",
+ " ])\n",
+ " def transform_features_fn(serialized_tf_example):\n",
+ " \"\"\"Returns the transformed_features to be fed as input to evaluator.\"\"\"\n",
+ " raw_feature_spec = tf_transform_output.raw_feature_spec()\n",
+ " raw_features = tf.io.parse_example(serialized_tf_example, raw_feature_spec)\n",
+ " transformed_features = model.tft_layer_eval(raw_features)\n",
+ " logging.info('eval_transformed_features = %s', transformed_features)\n",
+ " return transformed_features\n",
+ "\n",
+ " return transform_features_fn\n",
+ "\n",
+ "\n",
+ "def export_serving_model(tf_transform_output, model, output_dir):\n",
+ " \"\"\"Exports a keras model for serving.\n",
+ " Args:\n",
+ " tf_transform_output: Wrapper around output of tf.Transform.\n",
+ " model: A keras model to export for serving.\n",
+ " output_dir: A directory where the model will be exported to.\n",
+ " \"\"\"\n",
+ " # The layer has to be saved to the model for keras tracking purpases.\n",
+ " model.tft_layer = tf_transform_output.transform_features_layer()\n",
+ "\n",
+ " signatures = {\n",
+ " 'serving_default':\n",
+ " _get_tf_examples_serving_signature(model, tf_transform_output),\n",
+ " 'transform_features':\n",
+ " _get_transform_features_signature(model, tf_transform_output),\n",
+ " }\n",
+ "\n",
+ " model.save(output_dir, save_format='tf', signatures=signatures)\n",
+ "\n",
+ "\n",
+ "def _build_keras_model(tf_transform_output: TFTransformOutput\n",
+ " ) -> tf.keras.Model:\n",
+ " \"\"\"Creates a DNN Keras model for classifying taxi data.\n",
+ "\n",
+ " Args:\n",
+ " tf_transform_output: [TFTransformOutput], the outputs from Transform\n",
+ "\n",
+ " Returns:\n",
+ " A keras Model.\n",
+ " \"\"\"\n",
+ " feature_spec = tf_transform_output.transformed_feature_spec().copy()\n",
+ " feature_spec.pop(_LABEL_KEY)\n",
+ "\n",
+ " inputs = {}\n",
+ " for key, spec in feature_spec.items():\n",
+ " if isinstance(spec, tf.io.VarLenFeature):\n",
+ " inputs[key] = tf.keras.layers.Input(\n",
+ " shape=[None], name=key, dtype=spec.dtype, sparse=True)\n",
+ " elif isinstance(spec, tf.io.FixedLenFeature):\n",
+ " # TODO(b/208879020): Move into schema such that spec.shape is [1] and not\n",
+ " # [] for scalars.\n",
+ " inputs[key] = tf.keras.layers.Input(\n",
+ " shape=spec.shape or [1], name=key, dtype=spec.dtype)\n",
+ " else:\n",
+ " raise ValueError('Spec type is not supported: ', key, spec)\n",
+ "\n",
+ " output = tf.keras.layers.Concatenate()(tf.nest.flatten(inputs))\n",
+ " output = tf.keras.layers.Dense(100, activation='relu')(output)\n",
+ " output = tf.keras.layers.Dense(70, activation='relu')(output)\n",
+ " output = tf.keras.layers.Dense(50, activation='relu')(output)\n",
+ " output = tf.keras.layers.Dense(20, activation='relu')(output)\n",
+ " output = tf.keras.layers.Dense(1)(output)\n",
+ " return tf.keras.Model(inputs=inputs, outputs=output)\n",
+ "\n",
+ "\n",
+ "# TFX Trainer will call this function.\n",
+ "def run_fn(fn_args: tfx.components.FnArgs):\n",
+ " \"\"\"Train the model based on given args.\n",
+ "\n",
+ " Args:\n",
+ " fn_args: Holds args used to train the model as name/value pairs.\n",
+ " \"\"\"\n",
+ " tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)\n",
+ "\n",
+ " train_dataset = _input_fn(fn_args.train_files, fn_args.data_accessor,\n",
+ " tf_transform_output, _BATCH_SIZE)\n",
+ " eval_dataset = _input_fn(fn_args.eval_files, fn_args.data_accessor,\n",
+ " tf_transform_output, _BATCH_SIZE)\n",
+ "\n",
+ " model = _build_keras_model(tf_transform_output)\n",
+ "\n",
+ " model.compile(\n",
+ " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
+ " optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),\n",
+ " metrics=[tf.keras.metrics.BinaryAccuracy()])\n",
+ "\n",
+ " tensorboard_callback = tf.keras.callbacks.TensorBoard(\n",
+ " log_dir=fn_args.model_run_dir, update_freq='batch')\n",
+ "\n",
+ " model.fit(\n",
+ " train_dataset,\n",
+ " steps_per_epoch=fn_args.train_steps,\n",
+ " validation_data=eval_dataset,\n",
+ " validation_steps=fn_args.eval_steps,\n",
+ " callbacks=[tensorboard_callback])\n",
+ "\n",
+ " # Export the model.\n",
+ " export_serving_model(tf_transform_output, model, fn_args.serving_model_dir)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GY4yTRaX4YJx"
+ },
+ "source": [
+ "Now, we pass in this model code to the `Trainer` component and run it to train the model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "429-vvCWibO0"
+ },
+ "outputs": [],
+ "source": [
+ "trainer = tfx.components.Trainer(\n",
+ " module_file=os.path.abspath(_taxi_trainer_module_file),\n",
+ " examples=transform.outputs['transformed_examples'],\n",
+ " transform_graph=transform.outputs['transform_graph'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " train_args=tfx.proto.TrainArgs(num_steps=10000),\n",
+ " eval_args=tfx.proto.EvalArgs(num_steps=5000))\n",
+ "context.run(trainer, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6Cql1G35StJp"
+ },
+ "source": [
+ "#### Analyze Training with TensorBoard\n",
+ "Take a peek at the trainer artifact. It points to a directory containing the model subdirectories."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bXe62WE0S0Ek"
+ },
+ "outputs": [],
+ "source": [
+ "model_artifact_dir = trainer.outputs['model'].get()[0].uri\n",
+ "pp.pprint(os.listdir(model_artifact_dir))\n",
+ "model_dir = os.path.join(model_artifact_dir, 'Format-Serving')\n",
+ "pp.pprint(os.listdir(model_dir))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DfjOmSro6Q3Y"
+ },
+ "source": [
+ "Optionally, we can connect TensorBoard to the Trainer to analyze our model's training curves."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "-APzqz2NeAyj"
+ },
+ "outputs": [],
+ "source": [
+ "model_run_artifact_dir = trainer.outputs['model_run'].get()[0].uri\n",
+ "\n",
+ "%load_ext tensorboard\n",
+ "%tensorboard --logdir {model_run_artifact_dir}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FmPftrv0lEQy"
+ },
+ "source": [
+ "### Evaluator\n",
+ "The `Evaluator` component computes model performance metrics over the evaluation set. It uses the [TensorFlow Model Analysis](https://www.tensorflow.org/tfx/model_analysis/get_started) library. The `Evaluator` can also optionally validate that a newly trained model is better than the previous model. This is useful in a production pipeline setting where you may automatically train and validate a model every day. In this notebook, we only train one model, so the `Evaluator` automatically will label the\n",
+ "model as \"good\".\n",
+ "\n",
+ "`Evaluator` will take as input the data from `ExampleGen`, the trained model from `Trainer`, and slicing configuration. The slicing configuration allows you to slice your metrics on feature values (e.g. how does your model perform on taxi trips that start at 8am versus 8pm?). See an example of this configuration below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "fVhfzzh9PDEx"
+ },
+ "outputs": [],
+ "source": [
+ "# Imported files such as taxi_constants are normally cached, so changes are\n",
+ "# not honored after the first import. Normally this is good for efficiency, but\n",
+ "# during development when we may be iterating code it can be a problem. To\n",
+ "# avoid this problem during development, reload the file.\n",
+ "import taxi_constants\n",
+ "import sys\n",
+ "if 'google.colab' in sys.modules: # Testing to see if we're doing development\n",
+ " import importlib\n",
+ " importlib.reload(taxi_constants)\n",
+ "\n",
+ "eval_config = tfma.EvalConfig(\n",
+ " model_specs=[\n",
+ " # This assumes a serving model with signature 'serving_default'. If\n",
+ " # using estimator based EvalSavedModel, add signature_name: 'eval' and\n",
+ " # remove the label_key.\n",
+ " tfma.ModelSpec(\n",
+ " signature_name='serving_default',\n",
+ " label_key=taxi_constants.LABEL_KEY,\n",
+ " preprocessing_function_names=['transform_features'],\n",
+ " )\n",
+ " ],\n",
+ " metrics_specs=[\n",
+ " tfma.MetricsSpec(\n",
+ " # The metrics added here are in addition to those saved with the\n",
+ " # model (assuming either a keras model or EvalSavedModel is used).\n",
+ " # Any metrics added into the saved model (for example using\n",
+ " # model.compile(..., metrics=[...]), etc) will be computed\n",
+ " # automatically.\n",
+ " # To add validation thresholds for metrics saved with the model,\n",
+ " # add them keyed by metric name to the thresholds map.\n",
+ " metrics=[\n",
+ " tfma.MetricConfig(class_name='ExampleCount'),\n",
+ " tfma.MetricConfig(class_name='BinaryAccuracy',\n",
+ " threshold=tfma.MetricThreshold(\n",
+ " value_threshold=tfma.GenericValueThreshold(\n",
+ " lower_bound={'value': 0.5}),\n",
+ " # Change threshold will be ignored if there is no\n",
+ " # baseline model resolved from MLMD (first run).\n",
+ " change_threshold=tfma.GenericChangeThreshold(\n",
+ " direction=tfma.MetricDirection.HIGHER_IS_BETTER,\n",
+ " absolute={'value': -1e-10})))\n",
+ " ]\n",
+ " )\n",
+ " ],\n",
+ " slicing_specs=[\n",
+ " # An empty slice spec means the overall slice, i.e. the whole dataset.\n",
+ " tfma.SlicingSpec(),\n",
+ " # Data can be sliced along a feature column. In this case, data is\n",
+ " # sliced along feature column trip_start_hour.\n",
+ " tfma.SlicingSpec(\n",
+ " feature_keys=['trip_start_hour'])\n",
+ " ])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9mBdKH1F8JuT"
+ },
+ "source": [
+ "Next, we give this configuration to `Evaluator` and run it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Zjcx8g6mihSt"
+ },
+ "outputs": [],
+ "source": [
+ "# Use TFMA to compute a evaluation statistics over features of a model and\n",
+ "# validate them against a baseline.\n",
+ "\n",
+ "# The model resolver is only required if performing model validation in addition\n",
+ "# to evaluation. In this case we validate against the latest blessed model. If\n",
+ "# no model has been blessed before (as in this case) the evaluator will make our\n",
+ "# candidate the first blessed model.\n",
+ "model_resolver = tfx.dsl.Resolver(\n",
+ " strategy_class=tfx.dsl.experimental.LatestBlessedModelStrategy,\n",
+ " model=tfx.dsl.Channel(type=tfx.types.standard_artifacts.Model),\n",
+ " model_blessing=tfx.dsl.Channel(\n",
+ " type=tfx.types.standard_artifacts.ModelBlessing)).with_id(\n",
+ " 'latest_blessed_model_resolver')\n",
+ "context.run(model_resolver, enable_cache=True)\n",
+ "\n",
+ "evaluator = tfx.components.Evaluator(\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " model=trainer.outputs['model'],\n",
+ " baseline_model=model_resolver.outputs['model'],\n",
+ " eval_config=eval_config)\n",
+ "context.run(evaluator, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AeCVkBusS_8g"
+ },
+ "source": [
+ "Now let's examine the output artifacts of `Evaluator`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "k4GghePOTJxL"
+ },
+ "outputs": [],
+ "source": [
+ "evaluator.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Y5TMskWe9LL0"
+ },
+ "source": [
+ "Using the `evaluation` output we can show the default visualization of global metrics on the entire evaluation set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "U729j5X5QQUQ"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(evaluator.outputs['evaluation'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t-tI4p6m-OAn"
+ },
+ "source": [
+ "To see the visualization for sliced evaluation metrics, we can directly call the TensorFlow Model Analysis library."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pyis6iy0HLdi"
+ },
+ "outputs": [],
+ "source": [
+ "import tensorflow_model_analysis as tfma\n",
+ "\n",
+ "# Get the TFMA output result path and load the result.\n",
+ "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
+ "tfma_result = tfma.load_eval_result(PATH_TO_RESULT)\n",
+ "\n",
+ "# Show data sliced along feature column trip_start_hour.\n",
+ "tfma.view.render_slicing_metrics(\n",
+ " tfma_result, slicing_column='trip_start_hour')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7uvYrUf2-r_6"
+ },
+ "source": [
+ "This visualization shows the same metrics, but computed at every feature value of `trip_start_hour` instead of on the entire evaluation set.\n",
+ "\n",
+ "TensorFlow Model Analysis supports many other visualizations, such as Fairness Indicators and plotting a time series of model performance. To learn more, see [the tutorial](/tutorials/model_analysis/tfma_basic)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TEotnkxEswUb"
+ },
+ "source": [
+ "Since we added thresholds to our config, validation output is also available. The presence of a `blessing` artifact indicates that our model passed validation. Since this is the first validation being performed the candidate is automatically blessed."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "FZmiRtg6TKtR"
+ },
+ "outputs": [],
+ "source": [
+ "blessing_uri = evaluator.outputs['blessing'].get()[0].uri\n",
+ "!ls -l {blessing_uri}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hM1tFkOVSBa0"
+ },
+ "source": [
+ "Now can also verify the success by loading the validation result record:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "lxa5G08bSJ8a"
+ },
+ "outputs": [],
+ "source": [
+ "PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri\n",
+ "print(tfma.load_validation_result(PATH_TO_RESULT))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T8DYekCZlHfj"
+ },
+ "source": [
+ "### Pusher\n",
+ "The `Pusher` component is usually at the end of a TFX pipeline. It checks whether a model has passed validation, and if so, exports the model to `_serving_model_dir`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "r45nQ69eikc9"
+ },
+ "outputs": [],
+ "source": [
+ "pusher = tfx.components.Pusher(\n",
+ " model=trainer.outputs['model'],\n",
+ " model_blessing=evaluator.outputs['blessing'],\n",
+ " push_destination=tfx.proto.PushDestination(\n",
+ " filesystem=tfx.proto.PushDestination.Filesystem(\n",
+ " base_directory=_serving_model_dir)))\n",
+ "context.run(pusher, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ctUErBYoTO9I"
+ },
+ "source": [
+ "Let's examine the output artifacts of `Pusher`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pRkWo-MzTSss"
+ },
+ "outputs": [],
+ "source": [
+ "pusher.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "peH2PPS3VgkL"
+ },
+ "source": [
+ "In particular, the Pusher will export your model in the SavedModel format, which looks like this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4zyIqWl9TSdG"
+ },
+ "outputs": [],
+ "source": [
+ "push_uri = pusher.outputs['pushed_model'].get()[0].uri\n",
+ "model = tf.saved_model.load(push_uri)\n",
+ "\n",
+ "for item in model.signatures.items():\n",
+ " pp.pprint(item)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3-YPNUuHANtj"
+ },
+ "source": [
+ "We're finished our tour of built-in TFX components!"
+ ]
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "collapsed_sections": [
+ "wdeKOEkv1Fe8"
+ ],
+ "name": "components_keras.ipynb",
+ "private_outputs": true,
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
}
diff --git a/docs/tutorials/tfx/gcp/vertex_pipelines_vertex_training.ipynb b/docs/tutorials/tfx/gcp/vertex_pipelines_vertex_training.ipynb
index 9773b9f317..55154e1d6e 100644
--- a/docs/tutorials/tfx/gcp/vertex_pipelines_vertex_training.ipynb
+++ b/docs/tutorials/tfx/gcp/vertex_pipelines_vertex_training.ipynb
@@ -1,50 +1,50 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "pknVo1kM2wI2"
- },
- "source": [
- "##### Copyright 2021 The TensorFlow Authors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "SoFqANDE222Y"
- },
- "outputs": [],
- "source": [
- "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "6x1ypzczQCwy"
- },
- "source": [
- "# Vertex AI Training and Serving with TFX and Vertex Pipelines\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_445qeKq8e3-"
- },
- "source": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pknVo1kM2wI2"
+ },
+ "source": [
+ "##### Copyright 2021 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SoFqANDE222Y"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6x1ypzczQCwy"
+ },
+ "source": [
+ "# Vertex AI Training and Serving with TFX and Vertex Pipelines\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_445qeKq8e3-"
+ },
+ "source": [
"Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n",
"\n",
"\n",
@@ -81,965 +81,965 @@
" \n",
"
"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_VuwrlnvQJ5k"
- },
- "source": [
- "This notebook-based tutorial will create and run a TFX pipeline which trains an\n",
- "ML model using Vertex AI Training service and publishes it to Vertex AI for serving.\n",
- "\n",
- "This notebook is based on the TFX pipeline we built in\n",
- "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple).\n",
- "If you have not read that tutorial yet, you should read it before proceeding\n",
- "with this notebook.\n",
- "\n",
- "You can train models on Vertex AI using AutoML, or use custom training. In\n",
- "custom training, you can select many different machine types to power your\n",
- "training jobs, enable distributed training, use hyperparameter tuning, and\n",
- "accelerate with GPUs.\n",
- "\n",
- "You can also serve prediction requests by deploying the trained model to Vertex AI\n",
- "Models and creating an endpoint.\n",
- "\n",
- "In this tutorial, we will use Vertex AI Training with custom jobs to train\n",
- "a model in a TFX pipeline.\n",
- "We will also deploy the model to serve prediction request using Vertex AI.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "S5Pv2qm3wfpL"
- },
- "source": [
- "This notebook is intended to be run on\n",
- "[Google Colab](https://colab.research.google.com/notebooks/intro.ipynb) or on\n",
- "[AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks). If you\n",
- "are not using one of these, you can simply click \"Run in Google Colab\" button\n",
- "above.\n",
- "\n",
- "## Set up\n",
- "If you have completed\n",
- "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple),\n",
- "you will have a working GCP project and a GCS bucket and that is all we need\n",
- "for this tutorial. Please read the preliminary tutorial first if you missed it."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fwZ0aXisoBFW"
- },
- "source": [
- "### Install python packages"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "WC9W_S-bONgl"
- },
- "source": [
- "We will install required Python packages including TFX and KFP to author ML\n",
- "pipelines and submit jobs to Vertex Pipelines."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "iyQtljP-qPHY"
- },
- "outputs": [],
- "source": [
- "# Use the latest version of pip.\n",
- "!pip install --upgrade pip\n",
- "!pip install --upgrade \"tfx[kfp]\u003c2\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "EwT0nov5QO1M"
- },
- "source": [
- "#### Did you restart the runtime?\n",
- "\n",
- "If you are using Google Colab, the first time that you run\n",
- "the cell above, you must restart the runtime by clicking\n",
- "above \"RESTART RUNTIME\" button or using \"Runtime \u003e Restart\n",
- "runtime ...\" menu. This is because of the way that Colab\n",
- "loads packages."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-CRyIL4LVDlQ"
- },
- "source": [
- "If you are not on Colab, you can restart runtime with following cell."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "KHTSzMygoBF6"
- },
- "outputs": [],
- "source": [
- "# docs_infra: no_execute\n",
- "import sys\n",
- "if not 'google.colab' in sys.modules:\n",
- " # Automatically restart kernel after installs\n",
- " import IPython\n",
- " app = IPython.Application.instance()\n",
- " app.kernel.do_shutdown(True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "gckGHdW9iPrq"
- },
- "source": [
- "### Login in to Google for this notebook\n",
- "If you are running this notebook on Colab, authenticate with your user account:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "kZQA0KrfXCvU"
- },
- "outputs": [],
- "source": [
- "import sys\n",
- "if 'google.colab' in sys.modules:\n",
- " from google.colab import auth\n",
- " auth.authenticate_user()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "aaqJjbmk6o0o"
- },
- "source": [
- "**If you are on AI Platform Notebooks**, authenticate with Google Cloud before\n",
- "running the next section, by running\n",
- "```sh\n",
- "gcloud auth login\n",
- "```\n",
- "**in the Terminal window** (which you can open via **File** \u003e **New** in the\n",
- "menu). You only need to do this once per notebook instance."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3_SveIKxaENu"
- },
- "source": [
- "Check the package versions."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Xd-iP9wEaENu"
- },
- "outputs": [],
- "source": [
- "import tensorflow as tf\n",
- "print('TensorFlow version: {}'.format(tf.__version__))\n",
- "from tfx import v1 as tfx\n",
- "print('TFX version: {}'.format(tfx.__version__))\n",
- "import kfp\n",
- "print('KFP version: {}'.format(kfp.__version__))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "aDtLdSkvqPHe"
- },
- "source": [
- "### Set up variables\n",
- "\n",
- "We will set up some variables used to customize the pipelines below. Following\n",
- "information is required:\n",
- "\n",
- "* GCP Project id. See\n",
- "[Identifying your project id](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).\n",
- "* GCP Region to run pipelines. For more information about the regions that\n",
- "Vertex Pipelines is available in, see the\n",
- "[Vertex AI locations guide](https://cloud.google.com/vertex-ai/docs/general/locations#feature-availability).\n",
- "* Google Cloud Storage Bucket to store pipeline outputs.\n",
- "\n",
- "**Enter required values in the cell below before running it**.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "EcUseqJaE2XN"
- },
- "outputs": [],
- "source": [
- "GOOGLE_CLOUD_PROJECT = '' # \u003c--- ENTER THIS\n",
- "GOOGLE_CLOUD_REGION = '' # \u003c--- ENTER THIS\n",
- "GCS_BUCKET_NAME = '' # \u003c--- ENTER THIS\n",
- "\n",
- "if not (GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION and GCS_BUCKET_NAME):\n",
- " from absl import logging\n",
- " logging.error('Please set all required parameters.')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GAaCPLjgiJrO"
- },
- "source": [
- "Set `gcloud` to use your project."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "VkWdxe4TXRHk"
- },
- "outputs": [],
- "source": [
- "!gcloud config set project {GOOGLE_CLOUD_PROJECT}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "CPN6UL5CazNy"
- },
- "outputs": [],
- "source": [
- "PIPELINE_NAME = 'penguin-vertex-training'\n",
- "\n",
- "# Path to various pipeline artifact.\n",
- "PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)\n",
- "\n",
- "# Paths for users' Python module.\n",
- "MODULE_ROOT = 'gs://{}/pipeline_module/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)\n",
- "\n",
- "# Paths for users' data.\n",
- "DATA_ROOT = 'gs://{}/data/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)\n",
- "\n",
- "# Name of Vertex AI Endpoint.\n",
- "ENDPOINT_NAME = 'prediction-' + PIPELINE_NAME\n",
- "\n",
- "print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8F2SRwRLSYGa"
- },
- "source": [
- "### Prepare example data\n",
- "We will use the same\n",
- "[Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)\n",
- "as\n",
- "[Simple TFX Pipeline Tutorial](/tutorials/tfx/penguin_simple).\n",
- "\n",
- "There are four numeric features in this dataset which were already normalized\n",
- "to have range [0,1]. We will build a classification model which predicts the\n",
- "`species` of penguins."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "11J7XiCq6AFP"
- },
- "source": [
- "We need to make our own copy of the dataset. Because TFX ExampleGen reads\n",
- "inputs from a directory, we need to create a directory and copy dataset to it\n",
- "on GCS."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4fxMs6u86acP"
- },
- "outputs": [],
- "source": [
- "!gsutil cp gs://download.tensorflow.org/data/palmer_penguins/penguins_processed.csv {DATA_ROOT}/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ASpoNmxKSQjI"
- },
- "source": [
- "Take a quick look at the CSV file."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "-eSz28UDSnlG"
- },
- "outputs": [],
- "source": [
- "!gsutil cat {DATA_ROOT}/penguins_processed.csv | head"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "nH6gizcpSwWV"
- },
- "source": [
- "## Create a pipeline\n",
- "\n",
- "Our pipeline will be very similar to the pipeline we created in\n",
- "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple).\n",
- "The pipeline will consists of three components, CsvExampleGen, Trainer and\n",
- "Pusher. But we will use a special Trainer and Pusher component. The Trainer component will move\n",
- "training workloads to Vertex AI, and the Pusher component will publish the\n",
- "trained ML model to Vertex AI instead of a filesystem.\n",
- "\n",
- "TFX provides a special `Trainer` to submit training jobs to Vertex AI Training\n",
- "service. All we have to do is use `Trainer` in the extension module\n",
- "instead of the standard `Trainer` component along with some required GCP\n",
- "parameters.\n",
- "\n",
- "In this tutorial, we will run Vertex AI Training jobs only using CPUs first\n",
- "and then with a GPU.\n",
- "\n",
- "TFX also provides a special `Pusher` to upload the model to *Vertex AI Models*.\n",
- "`Pusher` will create *Vertex AI Endpoint* resource to serve online\n",
- "perdictions, too. See\n",
- "[Vertex AI documentation](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions)\n",
- "to learn more about online predictions provided by Vertex AI."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "lOjDv93eS5xV"
- },
- "source": [
- "### Write model code.\n",
- "\n",
- "The model itself is almost similar to the model in\n",
- "[Simple TFX Pipeline Tutorial](/tutorials/tfx/penguin_simple).\n",
- "\n",
- "We will add `_get_distribution_strategy()` function which creates a\n",
- "[TensorFlow distribution strategy](https://www.tensorflow.org/guide/distributed_training)\n",
- "and it is used in `run_fn` to use MirroredStrategy if GPU is available."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "aES7Hv5QTDK3"
- },
- "outputs": [],
- "source": [
- "_trainer_module_file = 'penguin_trainer.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Gnc67uQNTDfW"
- },
- "outputs": [],
- "source": [
- "%%writefile {_trainer_module_file}\n",
- "\n",
- "# Copied from https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple and\n",
- "# slightly modified run_fn() to add distribution_strategy.\n",
- "\n",
- "from typing import List\n",
- "from absl import logging\n",
- "import tensorflow as tf\n",
- "from tensorflow import keras\n",
- "from tensorflow_metadata.proto.v0 import schema_pb2\n",
- "from tensorflow_transform.tf_metadata import schema_utils\n",
- "\n",
- "from tfx import v1 as tfx\n",
- "from tfx_bsl.public import tfxio\n",
- "\n",
- "_FEATURE_KEYS = [\n",
- " 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'\n",
- "]\n",
- "_LABEL_KEY = 'species'\n",
- "\n",
- "_TRAIN_BATCH_SIZE = 20\n",
- "_EVAL_BATCH_SIZE = 10\n",
- "\n",
- "# Since we're not generating or creating a schema, we will instead create\n",
- "# a feature spec. Since there are a fairly small number of features this is\n",
- "# manageable for this dataset.\n",
- "_FEATURE_SPEC = {\n",
- " **{\n",
- " feature: tf.io.FixedLenFeature(shape=[1], dtype=tf.float32)\n",
- " for feature in _FEATURE_KEYS\n",
- " }, _LABEL_KEY: tf.io.FixedLenFeature(shape=[1], dtype=tf.int64)\n",
- "}\n",
- "\n",
- "\n",
- "def _input_fn(file_pattern: List[str],\n",
- " data_accessor: tfx.components.DataAccessor,\n",
- " schema: schema_pb2.Schema,\n",
- " batch_size: int) -\u003e tf.data.Dataset:\n",
- " \"\"\"Generates features and label for training.\n",
- "\n",
- " Args:\n",
- " file_pattern: List of paths or patterns of input tfrecord files.\n",
- " data_accessor: DataAccessor for converting input to RecordBatch.\n",
- " schema: schema of the input data.\n",
- " batch_size: representing the number of consecutive elements of returned\n",
- " dataset to combine in a single batch\n",
- "\n",
- " Returns:\n",
- " A dataset that contains (features, indices) tuple where features is a\n",
- " dictionary of Tensors, and indices is a single Tensor of label indices.\n",
- " \"\"\"\n",
- " return data_accessor.tf_dataset_factory(\n",
- " file_pattern,\n",
- " tfxio.TensorFlowDatasetOptions(\n",
- " batch_size=batch_size, label_key=_LABEL_KEY),\n",
- " schema=schema).repeat()\n",
- "\n",
- "\n",
- "def _make_keras_model() -\u003e tf.keras.Model:\n",
- " \"\"\"Creates a DNN Keras model for classifying penguin data.\n",
- "\n",
- " Returns:\n",
- " A Keras Model.\n",
- " \"\"\"\n",
- " # The model below is built with Functional API, please refer to\n",
- " # https://www.tensorflow.org/guide/keras/overview for all API options.\n",
- " inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]\n",
- " d = keras.layers.concatenate(inputs)\n",
- " for _ in range(2):\n",
- " d = keras.layers.Dense(8, activation='relu')(d)\n",
- " outputs = keras.layers.Dense(3)(d)\n",
- "\n",
- " model = keras.Model(inputs=inputs, outputs=outputs)\n",
- " model.compile(\n",
- " optimizer=keras.optimizers.Adam(1e-2),\n",
- " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n",
- " metrics=[keras.metrics.SparseCategoricalAccuracy()])\n",
- "\n",
- " model.summary(print_fn=logging.info)\n",
- " return model\n",
- "\n",
- "\n",
- "# NEW: Read `use_gpu` from the custom_config of the Trainer.\n",
- "# if it uses GPU, enable MirroredStrategy.\n",
- "def _get_distribution_strategy(fn_args: tfx.components.FnArgs):\n",
- " if fn_args.custom_config.get('use_gpu', False):\n",
- " logging.info('Using MirroredStrategy with one GPU.')\n",
- " return tf.distribute.MirroredStrategy(devices=['device:GPU:0'])\n",
- " return None\n",
- "\n",
- "\n",
- "# TFX Trainer will call this function.\n",
- "def run_fn(fn_args: tfx.components.FnArgs):\n",
- " \"\"\"Train the model based on given args.\n",
- "\n",
- " Args:\n",
- " fn_args: Holds args used to train the model as name/value pairs.\n",
- " \"\"\"\n",
- "\n",
- " # This schema is usually either an output of SchemaGen or a manually-curated\n",
- " # version provided by pipeline author. A schema can also derived from TFT\n",
- " # graph if a Transform component is used. In the case when either is missing,\n",
- " # `schema_from_feature_spec` could be used to generate schema from very simple\n",
- " # feature_spec, but the schema returned would be very primitive.\n",
- " schema = schema_utils.schema_from_feature_spec(_FEATURE_SPEC)\n",
- "\n",
- " train_dataset = _input_fn(\n",
- " fn_args.train_files,\n",
- " fn_args.data_accessor,\n",
- " schema,\n",
- " batch_size=_TRAIN_BATCH_SIZE)\n",
- " eval_dataset = _input_fn(\n",
- " fn_args.eval_files,\n",
- " fn_args.data_accessor,\n",
- " schema,\n",
- " batch_size=_EVAL_BATCH_SIZE)\n",
- "\n",
- " # NEW: If we have a distribution strategy, build a model in a strategy scope.\n",
- " strategy = _get_distribution_strategy(fn_args)\n",
- " if strategy is None:\n",
- " model = _make_keras_model()\n",
- " else:\n",
- " with strategy.scope():\n",
- " model = _make_keras_model()\n",
- "\n",
- " model.fit(\n",
- " train_dataset,\n",
- " steps_per_epoch=fn_args.train_steps,\n",
- " validation_data=eval_dataset,\n",
- " validation_steps=fn_args.eval_steps)\n",
- "\n",
- " # The result of the training should be saved in `fn_args.serving_model_dir`\n",
- " # directory.\n",
- " model.save(fn_args.serving_model_dir, save_format='tf')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-LsYx8MpYvPv"
- },
- "source": [
- "Copy the module file to GCS which can be accessed from the pipeline components.\n",
- "\n",
- "Otherwise, you might want to build a container image including the module file\n",
- "and use the image to run the pipeline and AI Platform Training jobs."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "rMMs5wuNYAbc"
- },
- "outputs": [],
- "source": [
- "!gsutil cp {_trainer_module_file} {MODULE_ROOT}/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "w3OkNz3gTLwM"
- },
- "source": [
- "### Write a pipeline definition\n",
- "\n",
- "We will define a function to create a TFX pipeline. It has the same three\n",
- "Components as in\n",
- "[Simple TFX Pipeline Tutorial](/tutorials/tfx/penguin_simple),\n",
- "but we use a `Trainer` and `Pusher` component in the GCP extension module.\n",
- "\n",
- "`tfx.extensions.google_cloud_ai_platform.Trainer` behaves like a regular\n",
- "`Trainer`, but it just moves the computation for the model training to cloud.\n",
- "It launches a custom job in Vertex AI Training service and the trainer\n",
- "component in the orchestration system will just wait until the Vertex AI\n",
- "Training job completes.\n",
- "\n",
- "`tfx.extensions.google_cloud_ai_platform.Pusher` creates a Vertex AI Model and a Vertex AI Endpoint using the\n",
- "trained model.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "M49yYVNBTPd4"
- },
- "outputs": [],
- "source": [
- "def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,\n",
- " module_file: str, endpoint_name: str, project_id: str,\n",
- " region: str, use_gpu: bool) -\u003e tfx.dsl.Pipeline:\n",
- " \"\"\"Implements the penguin pipeline with TFX.\"\"\"\n",
- " # Brings data into the pipeline or otherwise joins/converts training data.\n",
- " example_gen = tfx.components.CsvExampleGen(input_base=data_root)\n",
- "\n",
- " # NEW: Configuration for Vertex AI Training.\n",
- " # This dictionary will be passed as `CustomJobSpec`.\n",
- " vertex_job_spec = {\n",
- " 'project': project_id,\n",
- " 'worker_pool_specs': [{\n",
- " 'machine_spec': {\n",
- " 'machine_type': 'n1-standard-4',\n",
- " },\n",
- " 'replica_count': 1,\n",
- " 'container_spec': {\n",
- " 'image_uri': 'gcr.io/tfx-oss-public/tfx:{}'.format(tfx.__version__),\n",
- " },\n",
- " }],\n",
- " }\n",
- " if use_gpu:\n",
- " # See https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec#acceleratortype\n",
- " # for available machine types.\n",
- " vertex_job_spec['worker_pool_specs'][0]['machine_spec'].update({\n",
- " 'accelerator_type': 'NVIDIA_TESLA_K80',\n",
- " 'accelerator_count': 1\n",
- " })\n",
- "\n",
- " # Trains a model using Vertex AI Training.\n",
- " # NEW: We need to specify a Trainer for GCP with related configs.\n",
- " trainer = tfx.extensions.google_cloud_ai_platform.Trainer(\n",
- " module_file=module_file,\n",
- " examples=example_gen.outputs['examples'],\n",
- " train_args=tfx.proto.TrainArgs(num_steps=100),\n",
- " eval_args=tfx.proto.EvalArgs(num_steps=5),\n",
- " custom_config={\n",
- " tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY:\n",
- " True,\n",
- " tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY:\n",
- " region,\n",
- " tfx.extensions.google_cloud_ai_platform.TRAINING_ARGS_KEY:\n",
- " vertex_job_spec,\n",
- " 'use_gpu':\n",
- " use_gpu,\n",
- " })\n",
- "\n",
- " # NEW: Configuration for pusher.\n",
- " vertex_serving_spec = {\n",
- " 'project_id': project_id,\n",
- " 'endpoint_name': endpoint_name,\n",
- " # Remaining argument is passed to aiplatform.Model.deploy()\n",
- " # See https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api#deploy_the_model\n",
- " # for the detail.\n",
- " #\n",
- " # Machine type is the compute resource to serve prediction requests.\n",
- " # See https://cloud.google.com/vertex-ai/docs/predictions/configure-compute#machine-types\n",
- " # for available machine types and acccerators.\n",
- " 'machine_type': 'n1-standard-4',\n",
- " }\n",
- "\n",
- " # Vertex AI provides pre-built containers with various configurations for\n",
- " # serving.\n",
- " # See https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers\n",
- " # for available container images.\n",
- " serving_image = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-6:latest'\n",
- " if use_gpu:\n",
- " vertex_serving_spec.update({\n",
- " 'accelerator_type': 'NVIDIA_TESLA_K80',\n",
- " 'accelerator_count': 1\n",
- " })\n",
- " serving_image = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-6:latest'\n",
- "\n",
- " # NEW: Pushes the model to Vertex AI.\n",
- " pusher = tfx.extensions.google_cloud_ai_platform.Pusher(\n",
- " model=trainer.outputs['model'],\n",
- " custom_config={\n",
- " tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY:\n",
- " True,\n",
- " tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY:\n",
- " region,\n",
- " tfx.extensions.google_cloud_ai_platform.VERTEX_CONTAINER_IMAGE_URI_KEY:\n",
- " serving_image,\n",
- " tfx.extensions.google_cloud_ai_platform.SERVING_ARGS_KEY:\n",
- " vertex_serving_spec,\n",
- " })\n",
- "\n",
- " components = [\n",
- " example_gen,\n",
- " trainer,\n",
- " pusher,\n",
- " ]\n",
- "\n",
- " return tfx.dsl.Pipeline(\n",
- " pipeline_name=pipeline_name,\n",
- " pipeline_root=pipeline_root,\n",
- " components=components)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "mJbq07THU2GV"
- },
- "source": [
- "## Run the pipeline on Vertex Pipelines.\n",
- "\n",
- "We will use Vertex Pipelines to run the pipeline as we did in\n",
- "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "fAtfOZTYWJu-"
- },
- "outputs": [],
- "source": [
- "# docs_infra: no_execute\n",
- "import os\n",
- "\n",
- "PIPELINE_DEFINITION_FILE = PIPELINE_NAME + '_pipeline.json'\n",
- "\n",
- "runner = tfx.orchestration.experimental.KubeflowV2DagRunner(\n",
- " config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(),\n",
- " output_filename=PIPELINE_DEFINITION_FILE)\n",
- "_ = runner.run(\n",
- " _create_pipeline(\n",
- " pipeline_name=PIPELINE_NAME,\n",
- " pipeline_root=PIPELINE_ROOT,\n",
- " data_root=DATA_ROOT,\n",
- " module_file=os.path.join(MODULE_ROOT, _trainer_module_file),\n",
- " endpoint_name=ENDPOINT_NAME,\n",
- " project_id=GOOGLE_CLOUD_PROJECT,\n",
- " region=GOOGLE_CLOUD_REGION,\n",
- " # We will use CPUs only for now.\n",
- " use_gpu=False))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fWyITYSDd8w4"
- },
- "source": [
- "The generated definition file can be submitted using Google Cloud aiplatform\n",
- "client in `google-cloud-aiplatform` package."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tI71jlEvWMV7"
- },
- "outputs": [],
- "source": [
- "# docs_infra: no_execute\n",
- "from google.cloud import aiplatform\n",
- "from google.cloud.aiplatform import pipeline_jobs\n",
- "import logging\n",
- "logging.getLogger().setLevel(logging.INFO)\n",
- "\n",
- "aiplatform.init(project=GOOGLE_CLOUD_PROJECT, location=GOOGLE_CLOUD_REGION)\n",
- "\n",
- "job = pipeline_jobs.PipelineJob(template_path=PIPELINE_DEFINITION_FILE,\n",
- " display_name=PIPELINE_NAME)\n",
- "job.submit()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "L3k9f5IVQXcQ"
- },
- "source": [
- "Now you can visit the link in the output above or visit 'Vertex AI \u003e Pipelines'\n",
- "in [Google Cloud Console](https://console.cloud.google.com/) to see the\n",
- "progress."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "JyN4bM8GOHHt"
- },
- "source": [
- "## Test with a prediction request\n",
- "\n",
- "Once the pipeline completes, you will find a *deployed* model at the one of the\n",
- "endpoints in 'Vertex AI \u003e Endpoints'. We need to know the id of the endpoint to\n",
- "send a prediction request to the new endpoint. This is different from the\n",
- "*endpoint name* we entered above. You can find the id at the [Endpoints page](https://console.cloud.google.com/vertex-ai/endpoints) in\n",
- "`Google Cloud Console`, it looks like a very long number.\n",
- "\n",
- "**Set ENDPOINT_ID below before running it.**\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "51EWzkj8Wdly"
- },
- "outputs": [],
- "source": [
- "ENDPOINT_ID='' # \u003c--- ENTER THIS\n",
- "if not ENDPOINT_ID:\n",
- " from absl import logging\n",
- " logging.error('Please set the endpoint id.')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "x9maWD7pK-yf"
- },
- "source": [
- "We use the same aiplatform client to send a request to the endpoint. We will\n",
- "send a prediction request for Penguin species classification. The input is the four features that we used, and the model will return three values, because our\n",
- "model outputs one value for each species.\n",
- "\n",
- "For example, the following specific example has the largest value at index '2'\n",
- "and will print '2'.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Gdzxst2_OoXH"
- },
- "outputs": [],
- "source": [
- "# docs_infra: no_execute\n",
- "import numpy as np\n",
- "\n",
- "# The AI Platform services require regional API endpoints.\n",
- "client_options = {\n",
- " 'api_endpoint': GOOGLE_CLOUD_REGION + '-aiplatform.googleapis.com'\n",
- " }\n",
- "# Initialize client that will be used to create and send requests.\n",
- "client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)\n",
- "\n",
- "# Set data values for the prediction request.\n",
- "# Our model expects 4 feature inputs and produces 3 output values for each\n",
- "# species. Note that the output is logit value rather than probabilities.\n",
- "# See the model code to understand input / output structure.\n",
- "instances = [{\n",
- " 'culmen_length_mm':[0.71],\n",
- " 'culmen_depth_mm':[0.38],\n",
- " 'flipper_length_mm':[0.98],\n",
- " 'body_mass_g': [0.78],\n",
- "}]\n",
- "\n",
- "endpoint = client.endpoint_path(\n",
- " project=GOOGLE_CLOUD_PROJECT,\n",
- " location=GOOGLE_CLOUD_REGION,\n",
- " endpoint=ENDPOINT_ID,\n",
- ")\n",
- "# Send a prediction request and get response.\n",
- "response = client.predict(endpoint=endpoint, instances=instances)\n",
- "\n",
- "# Uses argmax to find the index of the maximum value.\n",
- "print('species:', np.argmax(response.predictions[0]))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "y5OBJLMLOowD"
- },
- "source": [
- "For detailed information about online prediction, please visit the\n",
- "[Endpoints page](https://console.cloud.google.com/vertex-ai/endpoints) in\n",
- "`Google Cloud Console`. you can find a guide on sending sample requests and\n",
- "links to more resources."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "DgVvdYPzzW6k"
- },
- "source": [
- "## Run the pipeline using a GPU\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Ht0Zpgx3L82g"
- },
- "source": [
- "Vertex AI supports training using various machine types including support for\n",
- "GPUs. See\n",
- "[Machine spec reference](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec#acceleratortype)\n",
- "for available options.\n",
- "\n",
- "We already defined our pipeline to support GPU training. All we need to do is\n",
- "setting `use_gpu` flag to True. Then a pipeline will be created with a machine\n",
- "spec including one NVIDIA_TESLA_K80 and our model training code will use\n",
- "`tf.distribute.MirroredStrategy`.\n",
- "\n",
- "Note that `use_gpu` flag is not a part of the Vertex or TFX API. It is just\n",
- "used to control the training code in this tutorial."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "1TwX6bcsLo_g"
- },
- "outputs": [],
- "source": [
- "# docs_infra: no_execute\n",
- "runner.run(\n",
- " _create_pipeline(\n",
- " pipeline_name=PIPELINE_NAME,\n",
- " pipeline_root=PIPELINE_ROOT,\n",
- " data_root=DATA_ROOT,\n",
- " module_file=os.path.join(MODULE_ROOT, _trainer_module_file),\n",
- " endpoint_name=ENDPOINT_NAME,\n",
- " project_id=GOOGLE_CLOUD_PROJECT,\n",
- " region=GOOGLE_CLOUD_REGION,\n",
- " # Updated: Use GPUs. We will use a NVIDIA_TESLA_K80 and \n",
- " # the model code will use tf.distribute.MirroredStrategy.\n",
- " use_gpu=True))\n",
- "\n",
- "job = pipeline_jobs.PipelineJob(template_path=PIPELINE_DEFINITION_FILE,\n",
- " display_name=PIPELINE_NAME)\n",
- "job.submit()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Xc9XsjlyKoZe"
- },
- "source": [
- "Now you can visit the link in the output above or visit 'Vertex AI \u003e Pipelines'\n",
- "in [Google Cloud Console](https://console.cloud.google.com/) to see the\n",
- "progress."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "M_coFG3sqSJQ"
- },
- "source": [
- "## Cleaning up\n",
- "\n",
- "You have created a Vertex AI Model and Endpoint in this tutorial.\n",
- "Please delete these resources to avoid any unwanted charges by going\n",
- "to [Endpoints](https://console.cloud.google.com/vertex-ai/endpoints) and\n",
- "*undeploying* the model from the endpoint first. Then you can delete the\n",
- "endpoint and the model separately."
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [
- "pknVo1kM2wI2",
- "8F2SRwRLSYGa"
- ],
- "name": "Vertex AI Training and Serving with TFX and Vertex Pipelines",
- "provenance": [],
- "toc_visible": true
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
},
- "nbformat": 4,
- "nbformat_minor": 0
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_VuwrlnvQJ5k"
+ },
+ "source": [
+ "This notebook-based tutorial will create and run a TFX pipeline which trains an\n",
+ "ML model using Vertex AI Training service and publishes it to Vertex AI for serving.\n",
+ "\n",
+ "This notebook is based on the TFX pipeline we built in\n",
+ "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple).\n",
+ "If you have not read that tutorial yet, you should read it before proceeding\n",
+ "with this notebook.\n",
+ "\n",
+ "You can train models on Vertex AI using AutoML, or use custom training. In\n",
+ "custom training, you can select many different machine types to power your\n",
+ "training jobs, enable distributed training, use hyperparameter tuning, and\n",
+ "accelerate with GPUs.\n",
+ "\n",
+ "You can also serve prediction requests by deploying the trained model to Vertex AI\n",
+ "Models and creating an endpoint.\n",
+ "\n",
+ "In this tutorial, we will use Vertex AI Training with custom jobs to train\n",
+ "a model in a TFX pipeline.\n",
+ "We will also deploy the model to serve prediction request using Vertex AI.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "S5Pv2qm3wfpL"
+ },
+ "source": [
+ "This notebook is intended to be run on\n",
+ "[Google Colab](https://colab.research.google.com/notebooks/intro.ipynb) or on\n",
+ "[AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks). If you\n",
+ "are not using one of these, you can simply click \"Run in Google Colab\" button\n",
+ "above.\n",
+ "\n",
+ "## Set up\n",
+ "If you have completed\n",
+ "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple),\n",
+ "you will have a working GCP project and a GCS bucket and that is all we need\n",
+ "for this tutorial. Please read the preliminary tutorial first if you missed it."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fwZ0aXisoBFW"
+ },
+ "source": [
+ "### Install python packages"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WC9W_S-bONgl"
+ },
+ "source": [
+ "We will install required Python packages including TFX and KFP to author ML\n",
+ "pipelines and submit jobs to Vertex Pipelines."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "iyQtljP-qPHY"
+ },
+ "outputs": [],
+ "source": [
+ "# Use the latest version of pip.\n",
+ "!pip install --upgrade pip\n",
+ "!pip install --upgrade \"tfx[kfp]<2\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EwT0nov5QO1M"
+ },
+ "source": [
+ "#### Did you restart the runtime?\n",
+ "\n",
+ "If you are using Google Colab, the first time that you run\n",
+ "the cell above, you must restart the runtime by clicking\n",
+ "above \"RESTART RUNTIME\" button or using \"Runtime > Restart\n",
+ "runtime ...\" menu. This is because of the way that Colab\n",
+ "loads packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-CRyIL4LVDlQ"
+ },
+ "source": [
+ "If you are not on Colab, you can restart runtime with following cell."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "KHTSzMygoBF6"
+ },
+ "outputs": [],
+ "source": [
+ "# docs_infra: no_execute\n",
+ "import sys\n",
+ "if not 'google.colab' in sys.modules:\n",
+ " # Automatically restart kernel after installs\n",
+ " import IPython\n",
+ " app = IPython.Application.instance()\n",
+ " app.kernel.do_shutdown(True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gckGHdW9iPrq"
+ },
+ "source": [
+ "### Login in to Google for this notebook\n",
+ "If you are running this notebook on Colab, authenticate with your user account:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "kZQA0KrfXCvU"
+ },
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "if 'google.colab' in sys.modules:\n",
+ " from google.colab import auth\n",
+ " auth.authenticate_user()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aaqJjbmk6o0o"
+ },
+ "source": [
+ "**If you are on AI Platform Notebooks**, authenticate with Google Cloud before\n",
+ "running the next section, by running\n",
+ "```sh\n",
+ "gcloud auth login\n",
+ "```\n",
+ "**in the Terminal window** (which you can open via **File** > **New** in the\n",
+ "menu). You only need to do this once per notebook instance."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3_SveIKxaENu"
+ },
+ "source": [
+ "Check the package versions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Xd-iP9wEaENu"
+ },
+ "outputs": [],
+ "source": [
+ "import tensorflow as tf\n",
+ "print('TensorFlow version: {}'.format(tf.__version__))\n",
+ "from tfx import v1 as tfx\n",
+ "print('TFX version: {}'.format(tfx.__version__))\n",
+ "import kfp\n",
+ "print('KFP version: {}'.format(kfp.__version__))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aDtLdSkvqPHe"
+ },
+ "source": [
+ "### Set up variables\n",
+ "\n",
+ "We will set up some variables used to customize the pipelines below. Following\n",
+ "information is required:\n",
+ "\n",
+ "* GCP Project id. See\n",
+ "[Identifying your project id](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).\n",
+ "* GCP Region to run pipelines. For more information about the regions that\n",
+ "Vertex Pipelines is available in, see the\n",
+ "[Vertex AI locations guide](https://cloud.google.com/vertex-ai/docs/general/locations#feature-availability).\n",
+ "* Google Cloud Storage Bucket to store pipeline outputs.\n",
+ "\n",
+ "**Enter required values in the cell below before running it**.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "EcUseqJaE2XN"
+ },
+ "outputs": [],
+ "source": [
+ "GOOGLE_CLOUD_PROJECT = '' # <--- ENTER THIS\n",
+ "GOOGLE_CLOUD_REGION = '' # <--- ENTER THIS\n",
+ "GCS_BUCKET_NAME = '' # <--- ENTER THIS\n",
+ "\n",
+ "if not (GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION and GCS_BUCKET_NAME):\n",
+ " from absl import logging\n",
+ " logging.error('Please set all required parameters.')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GAaCPLjgiJrO"
+ },
+ "source": [
+ "Set `gcloud` to use your project."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VkWdxe4TXRHk"
+ },
+ "outputs": [],
+ "source": [
+ "!gcloud config set project {GOOGLE_CLOUD_PROJECT}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "CPN6UL5CazNy"
+ },
+ "outputs": [],
+ "source": [
+ "PIPELINE_NAME = 'penguin-vertex-training'\n",
+ "\n",
+ "# Path to various pipeline artifact.\n",
+ "PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)\n",
+ "\n",
+ "# Paths for users' Python module.\n",
+ "MODULE_ROOT = 'gs://{}/pipeline_module/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)\n",
+ "\n",
+ "# Paths for users' data.\n",
+ "DATA_ROOT = 'gs://{}/data/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)\n",
+ "\n",
+ "# Name of Vertex AI Endpoint.\n",
+ "ENDPOINT_NAME = 'prediction-' + PIPELINE_NAME\n",
+ "\n",
+ "print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8F2SRwRLSYGa"
+ },
+ "source": [
+ "### Prepare example data\n",
+ "We will use the same\n",
+ "[Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)\n",
+ "as\n",
+ "[Simple TFX Pipeline Tutorial](/tutorials/tfx/penguin_simple).\n",
+ "\n",
+ "There are four numeric features in this dataset which were already normalized\n",
+ "to have range [0,1]. We will build a classification model which predicts the\n",
+ "`species` of penguins."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "11J7XiCq6AFP"
+ },
+ "source": [
+ "We need to make our own copy of the dataset. Because TFX ExampleGen reads\n",
+ "inputs from a directory, we need to create a directory and copy dataset to it\n",
+ "on GCS."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4fxMs6u86acP"
+ },
+ "outputs": [],
+ "source": [
+ "!gsutil cp gs://download.tensorflow.org/data/palmer_penguins/penguins_processed.csv {DATA_ROOT}/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ASpoNmxKSQjI"
+ },
+ "source": [
+ "Take a quick look at the CSV file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "-eSz28UDSnlG"
+ },
+ "outputs": [],
+ "source": [
+ "!gsutil cat {DATA_ROOT}/penguins_processed.csv | head"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nH6gizcpSwWV"
+ },
+ "source": [
+ "## Create a pipeline\n",
+ "\n",
+ "Our pipeline will be very similar to the pipeline we created in\n",
+ "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple).\n",
+ "The pipeline will consists of three components, CsvExampleGen, Trainer and\n",
+ "Pusher. But we will use a special Trainer and Pusher component. The Trainer component will move\n",
+ "training workloads to Vertex AI, and the Pusher component will publish the\n",
+ "trained ML model to Vertex AI instead of a filesystem.\n",
+ "\n",
+ "TFX provides a special `Trainer` to submit training jobs to Vertex AI Training\n",
+ "service. All we have to do is use `Trainer` in the extension module\n",
+ "instead of the standard `Trainer` component along with some required GCP\n",
+ "parameters.\n",
+ "\n",
+ "In this tutorial, we will run Vertex AI Training jobs only using CPUs first\n",
+ "and then with a GPU.\n",
+ "\n",
+ "TFX also provides a special `Pusher` to upload the model to *Vertex AI Models*.\n",
+ "`Pusher` will create *Vertex AI Endpoint* resource to serve online\n",
+ "predictions, too. See\n",
+ "[Vertex AI documentation](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions)\n",
+ "to learn more about online predictions provided by Vertex AI."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lOjDv93eS5xV"
+ },
+ "source": [
+ "### Write model code.\n",
+ "\n",
+ "The model itself is almost similar to the model in\n",
+ "[Simple TFX Pipeline Tutorial](/tutorials/tfx/penguin_simple).\n",
+ "\n",
+ "We will add `_get_distribution_strategy()` function which creates a\n",
+ "[TensorFlow distribution strategy](https://www.tensorflow.org/guide/distributed_training)\n",
+ "and it is used in `run_fn` to use MirroredStrategy if GPU is available."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "aES7Hv5QTDK3"
+ },
+ "outputs": [],
+ "source": [
+ "_trainer_module_file = 'penguin_trainer.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Gnc67uQNTDfW"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_trainer_module_file}\n",
+ "\n",
+ "# Copied from https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple and\n",
+ "# slightly modified run_fn() to add distribution_strategy.\n",
+ "\n",
+ "from typing import List\n",
+ "from absl import logging\n",
+ "import tensorflow as tf\n",
+ "from tensorflow import keras\n",
+ "from tensorflow_metadata.proto.v0 import schema_pb2\n",
+ "from tensorflow_transform.tf_metadata import schema_utils\n",
+ "\n",
+ "from tfx import v1 as tfx\n",
+ "from tfx_bsl.public import tfxio\n",
+ "\n",
+ "_FEATURE_KEYS = [\n",
+ " 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'\n",
+ "]\n",
+ "_LABEL_KEY = 'species'\n",
+ "\n",
+ "_TRAIN_BATCH_SIZE = 20\n",
+ "_EVAL_BATCH_SIZE = 10\n",
+ "\n",
+ "# Since we're not generating or creating a schema, we will instead create\n",
+ "# a feature spec. Since there are a fairly small number of features this is\n",
+ "# manageable for this dataset.\n",
+ "_FEATURE_SPEC = {\n",
+ " **{\n",
+ " feature: tf.io.FixedLenFeature(shape=[1], dtype=tf.float32)\n",
+ " for feature in _FEATURE_KEYS\n",
+ " }, _LABEL_KEY: tf.io.FixedLenFeature(shape=[1], dtype=tf.int64)\n",
+ "}\n",
+ "\n",
+ "\n",
+ "def _input_fn(file_pattern: List[str],\n",
+ " data_accessor: tfx.components.DataAccessor,\n",
+ " schema: schema_pb2.Schema,\n",
+ " batch_size: int) -> tf.data.Dataset:\n",
+ " \"\"\"Generates features and label for training.\n",
+ "\n",
+ " Args:\n",
+ " file_pattern: List of paths or patterns of input tfrecord files.\n",
+ " data_accessor: DataAccessor for converting input to RecordBatch.\n",
+ " schema: schema of the input data.\n",
+ " batch_size: representing the number of consecutive elements of returned\n",
+ " dataset to combine in a single batch\n",
+ "\n",
+ " Returns:\n",
+ " A dataset that contains (features, indices) tuple where features is a\n",
+ " dictionary of Tensors, and indices is a single Tensor of label indices.\n",
+ " \"\"\"\n",
+ " return data_accessor.tf_dataset_factory(\n",
+ " file_pattern,\n",
+ " tfxio.TensorFlowDatasetOptions(\n",
+ " batch_size=batch_size, label_key=_LABEL_KEY),\n",
+ " schema=schema).repeat()\n",
+ "\n",
+ "\n",
+ "def _make_keras_model() -> tf.keras.Model:\n",
+ " \"\"\"Creates a DNN Keras model for classifying penguin data.\n",
+ "\n",
+ " Returns:\n",
+ " A Keras Model.\n",
+ " \"\"\"\n",
+ " # The model below is built with Functional API, please refer to\n",
+ " # https://www.tensorflow.org/guide/keras/overview for all API options.\n",
+ " inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]\n",
+ " d = keras.layers.concatenate(inputs)\n",
+ " for _ in range(2):\n",
+ " d = keras.layers.Dense(8, activation='relu')(d)\n",
+ " outputs = keras.layers.Dense(3)(d)\n",
+ "\n",
+ " model = keras.Model(inputs=inputs, outputs=outputs)\n",
+ " model.compile(\n",
+ " optimizer=keras.optimizers.Adam(1e-2),\n",
+ " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n",
+ " metrics=[keras.metrics.SparseCategoricalAccuracy()])\n",
+ "\n",
+ " model.summary(print_fn=logging.info)\n",
+ " return model\n",
+ "\n",
+ "\n",
+ "# NEW: Read `use_gpu` from the custom_config of the Trainer.\n",
+ "# if it uses GPU, enable MirroredStrategy.\n",
+ "def _get_distribution_strategy(fn_args: tfx.components.FnArgs):\n",
+ " if fn_args.custom_config.get('use_gpu', False):\n",
+ " logging.info('Using MirroredStrategy with one GPU.')\n",
+ " return tf.distribute.MirroredStrategy(devices=['device:GPU:0'])\n",
+ " return None\n",
+ "\n",
+ "\n",
+ "# TFX Trainer will call this function.\n",
+ "def run_fn(fn_args: tfx.components.FnArgs):\n",
+ " \"\"\"Train the model based on given args.\n",
+ "\n",
+ " Args:\n",
+ " fn_args: Holds args used to train the model as name/value pairs.\n",
+ " \"\"\"\n",
+ "\n",
+ " # This schema is usually either an output of SchemaGen or a manually-curated\n",
+ " # version provided by pipeline author. A schema can also derived from TFT\n",
+ " # graph if a Transform component is used. In the case when either is missing,\n",
+ " # `schema_from_feature_spec` could be used to generate schema from very simple\n",
+ " # feature_spec, but the schema returned would be very primitive.\n",
+ " schema = schema_utils.schema_from_feature_spec(_FEATURE_SPEC)\n",
+ "\n",
+ " train_dataset = _input_fn(\n",
+ " fn_args.train_files,\n",
+ " fn_args.data_accessor,\n",
+ " schema,\n",
+ " batch_size=_TRAIN_BATCH_SIZE)\n",
+ " eval_dataset = _input_fn(\n",
+ " fn_args.eval_files,\n",
+ " fn_args.data_accessor,\n",
+ " schema,\n",
+ " batch_size=_EVAL_BATCH_SIZE)\n",
+ "\n",
+ " # NEW: If we have a distribution strategy, build a model in a strategy scope.\n",
+ " strategy = _get_distribution_strategy(fn_args)\n",
+ " if strategy is None:\n",
+ " model = _make_keras_model()\n",
+ " else:\n",
+ " with strategy.scope():\n",
+ " model = _make_keras_model()\n",
+ "\n",
+ " model.fit(\n",
+ " train_dataset,\n",
+ " steps_per_epoch=fn_args.train_steps,\n",
+ " validation_data=eval_dataset,\n",
+ " validation_steps=fn_args.eval_steps)\n",
+ "\n",
+ " # The result of the training should be saved in `fn_args.serving_model_dir`\n",
+ " # directory.\n",
+ " model.save(fn_args.serving_model_dir, save_format='tf')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-LsYx8MpYvPv"
+ },
+ "source": [
+ "Copy the module file to GCS which can be accessed from the pipeline components.\n",
+ "\n",
+ "Otherwise, you might want to build a container image including the module file\n",
+ "and use the image to run the pipeline and AI Platform Training jobs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rMMs5wuNYAbc"
+ },
+ "outputs": [],
+ "source": [
+ "!gsutil cp {_trainer_module_file} {MODULE_ROOT}/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "w3OkNz3gTLwM"
+ },
+ "source": [
+ "### Write a pipeline definition\n",
+ "\n",
+ "We will define a function to create a TFX pipeline. It has the same three\n",
+ "Components as in\n",
+ "[Simple TFX Pipeline Tutorial](/tutorials/tfx/penguin_simple),\n",
+ "but we use a `Trainer` and `Pusher` component in the GCP extension module.\n",
+ "\n",
+ "`tfx.extensions.google_cloud_ai_platform.Trainer` behaves like a regular\n",
+ "`Trainer`, but it just moves the computation for the model training to cloud.\n",
+ "It launches a custom job in Vertex AI Training service and the trainer\n",
+ "component in the orchestration system will just wait until the Vertex AI\n",
+ "Training job completes.\n",
+ "\n",
+ "`tfx.extensions.google_cloud_ai_platform.Pusher` creates a Vertex AI Model and a Vertex AI Endpoint using the\n",
+ "trained model.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "M49yYVNBTPd4"
+ },
+ "outputs": [],
+ "source": [
+ "def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,\n",
+ " module_file: str, endpoint_name: str, project_id: str,\n",
+ " region: str, use_gpu: bool) -> tfx.dsl.Pipeline:\n",
+ " \"\"\"Implements the penguin pipeline with TFX.\"\"\"\n",
+ " # Brings data into the pipeline or otherwise joins/converts training data.\n",
+ " example_gen = tfx.components.CsvExampleGen(input_base=data_root)\n",
+ "\n",
+ " # NEW: Configuration for Vertex AI Training.\n",
+ " # This dictionary will be passed as `CustomJobSpec`.\n",
+ " vertex_job_spec = {\n",
+ " 'project': project_id,\n",
+ " 'worker_pool_specs': [{\n",
+ " 'machine_spec': {\n",
+ " 'machine_type': 'n1-standard-4',\n",
+ " },\n",
+ " 'replica_count': 1,\n",
+ " 'container_spec': {\n",
+ " 'image_uri': 'gcr.io/tfx-oss-public/tfx:{}'.format(tfx.__version__),\n",
+ " },\n",
+ " }],\n",
+ " }\n",
+ " if use_gpu:\n",
+ " # See https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec#acceleratortype\n",
+ " # for available machine types.\n",
+ " vertex_job_spec['worker_pool_specs'][0]['machine_spec'].update({\n",
+ " 'accelerator_type': 'NVIDIA_TESLA_K80',\n",
+ " 'accelerator_count': 1\n",
+ " })\n",
+ "\n",
+ " # Trains a model using Vertex AI Training.\n",
+ " # NEW: We need to specify a Trainer for GCP with related configs.\n",
+ " trainer = tfx.extensions.google_cloud_ai_platform.Trainer(\n",
+ " module_file=module_file,\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " train_args=tfx.proto.TrainArgs(num_steps=100),\n",
+ " eval_args=tfx.proto.EvalArgs(num_steps=5),\n",
+ " custom_config={\n",
+ " tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY:\n",
+ " True,\n",
+ " tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY:\n",
+ " region,\n",
+ " tfx.extensions.google_cloud_ai_platform.TRAINING_ARGS_KEY:\n",
+ " vertex_job_spec,\n",
+ " 'use_gpu':\n",
+ " use_gpu,\n",
+ " })\n",
+ "\n",
+ " # NEW: Configuration for pusher.\n",
+ " vertex_serving_spec = {\n",
+ " 'project_id': project_id,\n",
+ " 'endpoint_name': endpoint_name,\n",
+ " # Remaining argument is passed to aiplatform.Model.deploy()\n",
+ " # See https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api#deploy_the_model\n",
+ " # for the detail.\n",
+ " #\n",
+ " # Machine type is the compute resource to serve prediction requests.\n",
+ " # See https://cloud.google.com/vertex-ai/docs/predictions/configure-compute#machine-types\n",
+ " # for available machine types and acccerators.\n",
+ " 'machine_type': 'n1-standard-4',\n",
+ " }\n",
+ "\n",
+ " # Vertex AI provides pre-built containers with various configurations for\n",
+ " # serving.\n",
+ " # See https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers\n",
+ " # for available container images.\n",
+ " serving_image = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-6:latest'\n",
+ " if use_gpu:\n",
+ " vertex_serving_spec.update({\n",
+ " 'accelerator_type': 'NVIDIA_TESLA_K80',\n",
+ " 'accelerator_count': 1\n",
+ " })\n",
+ " serving_image = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-6:latest'\n",
+ "\n",
+ " # NEW: Pushes the model to Vertex AI.\n",
+ " pusher = tfx.extensions.google_cloud_ai_platform.Pusher(\n",
+ " model=trainer.outputs['model'],\n",
+ " custom_config={\n",
+ " tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY:\n",
+ " True,\n",
+ " tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY:\n",
+ " region,\n",
+ " tfx.extensions.google_cloud_ai_platform.VERTEX_CONTAINER_IMAGE_URI_KEY:\n",
+ " serving_image,\n",
+ " tfx.extensions.google_cloud_ai_platform.SERVING_ARGS_KEY:\n",
+ " vertex_serving_spec,\n",
+ " })\n",
+ "\n",
+ " components = [\n",
+ " example_gen,\n",
+ " trainer,\n",
+ " pusher,\n",
+ " ]\n",
+ "\n",
+ " return tfx.dsl.Pipeline(\n",
+ " pipeline_name=pipeline_name,\n",
+ " pipeline_root=pipeline_root,\n",
+ " components=components)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mJbq07THU2GV"
+ },
+ "source": [
+ "## Run the pipeline on Vertex Pipelines.\n",
+ "\n",
+ "We will use Vertex Pipelines to run the pipeline as we did in\n",
+ "[Simple TFX Pipeline for Vertex Pipelines Tutorial](/tutorials/tfx/gcp/vertex_pipelines_simple)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "fAtfOZTYWJu-"
+ },
+ "outputs": [],
+ "source": [
+ "# docs_infra: no_execute\n",
+ "import os\n",
+ "\n",
+ "PIPELINE_DEFINITION_FILE = PIPELINE_NAME + '_pipeline.json'\n",
+ "\n",
+ "runner = tfx.orchestration.experimental.KubeflowV2DagRunner(\n",
+ " config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(),\n",
+ " output_filename=PIPELINE_DEFINITION_FILE)\n",
+ "_ = runner.run(\n",
+ " _create_pipeline(\n",
+ " pipeline_name=PIPELINE_NAME,\n",
+ " pipeline_root=PIPELINE_ROOT,\n",
+ " data_root=DATA_ROOT,\n",
+ " module_file=os.path.join(MODULE_ROOT, _trainer_module_file),\n",
+ " endpoint_name=ENDPOINT_NAME,\n",
+ " project_id=GOOGLE_CLOUD_PROJECT,\n",
+ " region=GOOGLE_CLOUD_REGION,\n",
+ " # We will use CPUs only for now.\n",
+ " use_gpu=False))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fWyITYSDd8w4"
+ },
+ "source": [
+ "The generated definition file can be submitted using Google Cloud aiplatform\n",
+ "client in `google-cloud-aiplatform` package."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tI71jlEvWMV7"
+ },
+ "outputs": [],
+ "source": [
+ "# docs_infra: no_execute\n",
+ "from google.cloud import aiplatform\n",
+ "from google.cloud.aiplatform import pipeline_jobs\n",
+ "import logging\n",
+ "logging.getLogger().setLevel(logging.INFO)\n",
+ "\n",
+ "aiplatform.init(project=GOOGLE_CLOUD_PROJECT, location=GOOGLE_CLOUD_REGION)\n",
+ "\n",
+ "job = pipeline_jobs.PipelineJob(template_path=PIPELINE_DEFINITION_FILE,\n",
+ " display_name=PIPELINE_NAME)\n",
+ "job.submit()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L3k9f5IVQXcQ"
+ },
+ "source": [
+ "Now you can visit the link in the output above or visit 'Vertex AI > Pipelines'\n",
+ "in [Google Cloud Console](https://console.cloud.google.com/) to see the\n",
+ "progress."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JyN4bM8GOHHt"
+ },
+ "source": [
+ "## Test with a prediction request\n",
+ "\n",
+ "Once the pipeline completes, you will find a *deployed* model at the one of the\n",
+ "endpoints in 'Vertex AI > Endpoints'. We need to know the id of the endpoint to\n",
+ "send a prediction request to the new endpoint. This is different from the\n",
+ "*endpoint name* we entered above. You can find the id at the [Endpoints page](https://console.cloud.google.com/vertex-ai/endpoints) in\n",
+ "`Google Cloud Console`, it looks like a very long number.\n",
+ "\n",
+ "**Set ENDPOINT_ID below before running it.**\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "51EWzkj8Wdly"
+ },
+ "outputs": [],
+ "source": [
+ "ENDPOINT_ID='' # <--- ENTER THIS\n",
+ "if not ENDPOINT_ID:\n",
+ " from absl import logging\n",
+ " logging.error('Please set the endpoint id.')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x9maWD7pK-yf"
+ },
+ "source": [
+ "We use the same aiplatform client to send a request to the endpoint. We will\n",
+ "send a prediction request for Penguin species classification. The input is the four features that we used, and the model will return three values, because our\n",
+ "model outputs one value for each species.\n",
+ "\n",
+ "For example, the following specific example has the largest value at index '2'\n",
+ "and will print '2'.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Gdzxst2_OoXH"
+ },
+ "outputs": [],
+ "source": [
+ "# docs_infra: no_execute\n",
+ "import numpy as np\n",
+ "\n",
+ "# The AI Platform services require regional API endpoints.\n",
+ "client_options = {\n",
+ " 'api_endpoint': GOOGLE_CLOUD_REGION + '-aiplatform.googleapis.com'\n",
+ " }\n",
+ "# Initialize client that will be used to create and send requests.\n",
+ "client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)\n",
+ "\n",
+ "# Set data values for the prediction request.\n",
+ "# Our model expects 4 feature inputs and produces 3 output values for each\n",
+ "# species. Note that the output is logit value rather than probabilities.\n",
+ "# See the model code to understand input / output structure.\n",
+ "instances = [{\n",
+ " 'culmen_length_mm':[0.71],\n",
+ " 'culmen_depth_mm':[0.38],\n",
+ " 'flipper_length_mm':[0.98],\n",
+ " 'body_mass_g': [0.78],\n",
+ "}]\n",
+ "\n",
+ "endpoint = client.endpoint_path(\n",
+ " project=GOOGLE_CLOUD_PROJECT,\n",
+ " location=GOOGLE_CLOUD_REGION,\n",
+ " endpoint=ENDPOINT_ID,\n",
+ ")\n",
+ "# Send a prediction request and get response.\n",
+ "response = client.predict(endpoint=endpoint, instances=instances)\n",
+ "\n",
+ "# Uses argmax to find the index of the maximum value.\n",
+ "print('species:', np.argmax(response.predictions[0]))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "y5OBJLMLOowD"
+ },
+ "source": [
+ "For detailed information about online prediction, please visit the\n",
+ "[Endpoints page](https://console.cloud.google.com/vertex-ai/endpoints) in\n",
+ "`Google Cloud Console`. you can find a guide on sending sample requests and\n",
+ "links to more resources."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DgVvdYPzzW6k"
+ },
+ "source": [
+ "## Run the pipeline using a GPU\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ht0Zpgx3L82g"
+ },
+ "source": [
+ "Vertex AI supports training using various machine types including support for\n",
+ "GPUs. See\n",
+ "[Machine spec reference](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec#acceleratortype)\n",
+ "for available options.\n",
+ "\n",
+ "We already defined our pipeline to support GPU training. All we need to do is\n",
+ "setting `use_gpu` flag to True. Then a pipeline will be created with a machine\n",
+ "spec including one NVIDIA_TESLA_K80 and our model training code will use\n",
+ "`tf.distribute.MirroredStrategy`.\n",
+ "\n",
+ "Note that `use_gpu` flag is not a part of the Vertex or TFX API. It is just\n",
+ "used to control the training code in this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "1TwX6bcsLo_g"
+ },
+ "outputs": [],
+ "source": [
+ "# docs_infra: no_execute\n",
+ "runner.run(\n",
+ " _create_pipeline(\n",
+ " pipeline_name=PIPELINE_NAME,\n",
+ " pipeline_root=PIPELINE_ROOT,\n",
+ " data_root=DATA_ROOT,\n",
+ " module_file=os.path.join(MODULE_ROOT, _trainer_module_file),\n",
+ " endpoint_name=ENDPOINT_NAME,\n",
+ " project_id=GOOGLE_CLOUD_PROJECT,\n",
+ " region=GOOGLE_CLOUD_REGION,\n",
+ " # Updated: Use GPUs. We will use a NVIDIA_TESLA_K80 and \n",
+ " # the model code will use tf.distribute.MirroredStrategy.\n",
+ " use_gpu=True))\n",
+ "\n",
+ "job = pipeline_jobs.PipelineJob(template_path=PIPELINE_DEFINITION_FILE,\n",
+ " display_name=PIPELINE_NAME)\n",
+ "job.submit()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Xc9XsjlyKoZe"
+ },
+ "source": [
+ "Now you can visit the link in the output above or visit 'Vertex AI > Pipelines'\n",
+ "in [Google Cloud Console](https://console.cloud.google.com/) to see the\n",
+ "progress."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "M_coFG3sqSJQ"
+ },
+ "source": [
+ "## Cleaning up\n",
+ "\n",
+ "You have created a Vertex AI Model and Endpoint in this tutorial.\n",
+ "Please delete these resources to avoid any unwanted charges by going\n",
+ "to [Endpoints](https://console.cloud.google.com/vertex-ai/endpoints) and\n",
+ "*undeploying* the model from the endpoint first. Then you can delete the\n",
+ "endpoint and the model separately."
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "collapsed_sections": [
+ "pknVo1kM2wI2",
+ "8F2SRwRLSYGa"
+ ],
+ "name": "Vertex AI Training and Serving with TFX and Vertex Pipelines",
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
}
diff --git a/docs/tutorials/tfx/gpt2_finetuning_and_conversion.ipynb b/docs/tutorials/tfx/gpt2_finetuning_and_conversion.ipynb
index 688268512f..cb4eaa717a 100644
--- a/docs/tutorials/tfx/gpt2_finetuning_and_conversion.ipynb
+++ b/docs/tutorials/tfx/gpt2_finetuning_and_conversion.ipynb
@@ -1,69 +1,49 @@
{
- "nbformat": 4,
- "nbformat_minor": 0,
- "metadata": {
- "colab": {
- "provenance": [],
- "collapsed_sections": [
- "iwgnKVaUuozP"
- ],
- "gpuType": "T4",
- "toc_visible": true
- },
- "kernelspec": {
- "name": "python3",
- "display_name": "Python 3"
- },
- "language_info": {
- "name": "python"
- },
- "accelerator": "GPU"
- },
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "YtDTm6wbIbpy"
- },
- "source": [
- "##### Copyright 2024 The TensorFlow Authors."
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "# Licensed under the Apache License, Version 2.0 (the \"License\");"
- ],
- "metadata": {
- "id": "iwgnKVaUuozP"
- }
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "kBFkQLk1In7I"
- },
- "outputs": [],
- "source": [
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "uf3QpfdiIl7O"
- },
- "source": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YtDTm6wbIbpy"
+ },
+ "source": [
+ "##### Copyright 2024 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iwgnKVaUuozP"
+ },
+ "source": [
+ "# Licensed under the Apache License, Version 2.0 (the \"License\");"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "kBFkQLk1In7I"
+ },
+ "outputs": [],
+ "source": [
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uf3QpfdiIl7O"
+ },
+ "source": [
"Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n",
"\n",
"\n",
@@ -100,1446 +80,1466 @@
" \n",
"
"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HU9YYythm0dx"
- },
- "source": [
- "### Why is this pipeline useful?\n",
- "\n",
- "TFX pipelines provide a powerful and structured approach to building and managing machine learning workflows, particularly those involving large language models. They offer significant advantages over traditional Python code, including:\n",
- "\n",
- "1. Enhanced Reproducibility: TFX pipelines ensure consistent results by capturing all steps and dependencies, eliminating the inconsistencies often associated with manual workflows.\n",
- "\n",
- "2. Scalability and Modularity: TFX allows for breaking down complex workflows into manageable, reusable components, promoting code organization.\n",
- "\n",
- "3. Streamlined Fine-Tuning and Conversion: The pipeline structure streamlines the fine-tuning and conversion processes of large language models, significantly reducing manual effort and time.\n",
- "\n",
- "4. Comprehensive Lineage Tracking: Through metadata tracking, TFX pipelines provide a clear understanding of data and model provenance, making debugging, auditing, and performance analysis much easier and more efficient.\n",
- "\n",
- "By leveraging the benefits of TFX pipelines, organizations can effectively manage the complexity of large language model development and deployment, achieving greater efficiency and control over their machine learning processes.\n",
- "\n",
- "### Note\n",
- "*GPT-2 is used here only to demonstrate the end-to-end process; the techniques and tooling introduced in this codelab are potentially transferrable to other generative language models such as Google T5.*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2WgJ8Z8gJB0s"
- },
- "source": [
- "## Before You Begin\n",
- "\n",
- "Colab offers different kinds of runtimes. Make sure to go to **Runtime -\u003e Change runtime type** and choose the GPU Hardware Accelerator runtime since you will finetune the GPT-2 model.\n",
- "\n",
- "**This tutorial's interactive pipeline is designed to function seamlessly with free Colab GPUs. However, for users opting to run the pipeline using the LocalDagRunner orchestrator (code provided at the end of this tutorial), a more substantial amount of GPU memory is required. Therefore, Colab Pro or a local machine equipped with a higher-capacity GPU is recommended for this approach.**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-sj3HvNcJEgC"
- },
- "source": [
- "## Set Up\n",
- "\n",
- "We first install required python packages."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "73c9sPckJFSi"
- },
- "source": [
- "### Upgrade Pip\n",
- "To avoid upgrading Pip in a system when running locally, check to make sure that we are running in Colab. Local systems can of course be upgraded separately."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "45pIxa6afWOf",
- "tags": []
- },
- "outputs": [],
- "source": [
- "try:\n",
- " import colab\n",
- " !pip install --upgrade pip\n",
- "\n",
- "except:\n",
- " pass"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "yIf40NdqJLAH"
- },
- "source": [
- "### Install TFX, Keras 3, KerasNLP and required Libraries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "A6mBN4dzfct7",
- "tags": []
- },
- "outputs": [],
- "source": [
- "!pip install -q tfx tensorflow-text more_itertools tensorflow_datasets\n",
- "!pip install -q --upgrade keras-nlp\n",
- "!pip install -q --upgrade keras"
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "*Note: pip's dependency resolver errors can be ignored. The required packages for this tutorial works as expected.*"
- ],
- "metadata": {
- "id": "KnyILJ-k3NAy"
- }
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "V0tnFDm6JRq_",
- "tags": []
- },
- "source": [
- "### Did you restart the runtime?\n",
- "\n",
- "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime by clicking above \"RESTART SESSION\" button or using `\"Runtime \u003e Restart session\"` menu. This is because of the way that Colab loads packages.\n",
- "\n",
- "Let's check the TensorFlow, Keras, Keras-nlp and TFX library versions."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Hf5FbRzcfpMg",
- "tags": []
- },
- "outputs": [],
- "source": [
- "import os\n",
- "os.environ[\"KERAS_BACKEND\"] = \"tensorflow\"\n",
- "\n",
- "import tensorflow as tf\n",
- "print('TensorFlow version: {}'.format(tf.__version__))\n",
- "from tfx import v1 as tfx\n",
- "print('TFX version: {}'.format(tfx.__version__))\n",
- "import keras\n",
- "print('Keras version: {}'.format(keras.__version__))\n",
- "import keras_nlp\n",
- "print('Keras NLP version: {}'.format(keras_nlp.__version__))\n",
- "\n",
- "keras.mixed_precision.set_global_policy(\"mixed_float16\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ng1a9cCAtepl"
- },
- "source": [
- "### Using TFX Interactive Context"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "k7ikXCc7v7Rh"
- },
- "source": [
- "An interactive context is used to provide global context when running a TFX pipeline in a notebook without using a runner or orchestrator such as Apache Airflow or Kubeflow. This style of development is only useful when developing the code for a pipeline, and cannot currently be used to deploy a working pipeline to production."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "TEge2nYDfwaM",
- "tags": []
- },
- "outputs": [],
- "source": [
- "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
- "context = InteractiveContext()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GF6Kk3MLxxCC"
- },
- "source": [
- "## Pipeline Overview\n",
- "\n",
- "Below are the components that this pipeline follows.\n",
- "\n",
- "* Custom Artifacts are artifacts that we have created for this pipeline. **Artifacts** are data that is produced by a component or consumed by a component. Artifacts are stored in a system for managing the storage and versioning of artifacts called MLMD.\n",
- "\n",
- "* **Components** are defined as the implementation of an ML task that you can use as a step in your pipeline\n",
- "* Aside from artifacts, **Parameters** are passed into the components to specify an argument.\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "BIBO-ueGVVHa"
- },
- "source": [
- "## ExampleGen\n",
- "We create a custom ExampleGen component which we use to load a TensorFlow Datasets (TFDS) dataset. This uses a custom executor in a FileBasedExampleGen.\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "pgvIaoAmXFVp",
- "tags": []
- },
- "outputs": [],
- "source": [
- "from typing import Any, Dict, List, Text\n",
- "import tensorflow_datasets as tfds\n",
- "import apache_beam as beam\n",
- "import json\n",
- "from tfx.components.example_gen.base_example_gen_executor import BaseExampleGenExecutor\n",
- "from tfx.components.example_gen.component import FileBasedExampleGen\n",
- "from tfx.components.example_gen import utils\n",
- "from tfx.dsl.components.base import executor_spec\n",
- "import os\n",
- "import pprint\n",
- "pp = pprint.PrettyPrinter()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Cjd9Z6SpVRCE",
- "tags": []
- },
- "outputs": [],
- "source": [
- "@beam.ptransform_fn\n",
- "@beam.typehints.with_input_types(beam.Pipeline)\n",
- "@beam.typehints.with_output_types(tf.train.Example)\n",
- "def _TFDatasetToExample(\n",
- " pipeline: beam.Pipeline,\n",
- " exec_properties: Dict[str, Any],\n",
- " split_pattern: str\n",
- " ) -\u003e beam.pvalue.PCollection:\n",
- " \"\"\"Read a TensorFlow Dataset and create tf.Examples\"\"\"\n",
- " custom_config = json.loads(exec_properties['custom_config'])\n",
- " dataset_name = custom_config['dataset']\n",
- " split_name = custom_config['split']\n",
- "\n",
- " builder = tfds.builder(dataset_name)\n",
- " builder.download_and_prepare()\n",
- "\n",
- " return (pipeline\n",
- " | 'MakeExamples' \u003e\u003e tfds.beam.ReadFromTFDS(builder, split=split_name)\n",
- " | 'AsNumpy' \u003e\u003e beam.Map(tfds.as_numpy)\n",
- " | 'ToDict' \u003e\u003e beam.Map(dict)\n",
- " | 'ToTFExample' \u003e\u003e beam.Map(utils.dict_to_example)\n",
- " )\n",
- "\n",
- "class TFDSExecutor(BaseExampleGenExecutor):\n",
- " def GetInputSourceToExamplePTransform(self) -\u003e beam.PTransform:\n",
- " \"\"\"Returns PTransform for TF Dataset to TF examples.\"\"\"\n",
- " return _TFDatasetToExample"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2D159hAzJgK2"
- },
- "source": [
- "For this demonstration, we're using a subset of the IMDb reviews dataset, representing 20% of the total data. This allows for a more manageable training process. You can modify the \"custom_config\" settings to experiment with larger amounts of data, up to the full dataset, depending on your computational resources."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "nNDu1ECBXuvI",
- "tags": []
- },
- "outputs": [],
- "source": [
- "example_gen = FileBasedExampleGen(\n",
- " input_base='dummy',\n",
- " custom_config={'dataset':'imdb_reviews', 'split':'train[:20%]'},\n",
- " custom_executor_spec=executor_spec.BeamExecutorSpec(TFDSExecutor))\n",
- "context.run(example_gen, enable_cache=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "74JGpvIgJgK2"
- },
- "source": [
- "We've developed a handy utility for examining datasets composed of TFExamples. When used with the reviews dataset, this tool returns a clear dictionary containing both the text and the corresponding label."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "GA8VMXKogXxB",
- "tags": []
- },
- "outputs": [],
- "source": [
- "def inspect_examples(component,\n",
- " channel_name='examples',\n",
- " split_name='train',\n",
- " num_examples=1):\n",
- " # Get the URI of the output artifact, which is a directory\n",
- " full_split_name = 'Split-{}'.format(split_name)\n",
- " print('channel_name: {}, split_name: {} (\\\"{}\\\"), num_examples: {}\\n'.format(\n",
- " channel_name, split_name, full_split_name, num_examples))\n",
- " train_uri = os.path.join(\n",
- " component.outputs[channel_name].get()[0].uri, full_split_name)\n",
- " print('train_uri: {}'.format(train_uri))\n",
- "\n",
- " # Get the list of files in this directory (all compressed TFRecord files)\n",
- " tfrecord_filenames = [os.path.join(train_uri, name)\n",
- " for name in os.listdir(train_uri)]\n",
- "\n",
- " # Create a `TFRecordDataset` to read these files\n",
- " dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- "\n",
- " # Iterate over the records and print them\n",
- " print()\n",
- " for tfrecord in dataset.take(num_examples):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "rcUvtz5egaIy",
- "tags": []
- },
- "outputs": [],
- "source": [
- "inspect_examples(example_gen, num_examples=1, split_name='eval')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "gVmx7JHK8RkO"
- },
- "source": [
- "## StatisticsGen\n",
- "\n",
- "`StatisticsGen` component computes statistics over your dataset for data analysis, such as the number of examples, the number of features, and the data types of the features. It uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library. `StatisticsGen` takes as input the dataset we just ingested using `ExampleGen`.\n",
- "\n",
- "*Note that the statistics generator is appropriate for tabular data, and therefore, text dataset for this LLM tutorial may not be the optimal dataset for the analysis with statistics generator.*"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "TzeNGNEnyq_d",
- "tags": []
- },
- "outputs": [],
- "source": [
- "from tfx.components import StatisticsGen"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "xWWl7LeRKsXA",
- "tags": []
- },
- "outputs": [],
- "source": [
- "statistics_gen = tfx.components.StatisticsGen(\n",
- " examples=example_gen.outputs['examples'], exclude_splits=['eval']\n",
- ")\n",
- "context.run(statistics_gen, enable_cache=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "LnWKjMyIVVB7"
- },
- "outputs": [],
- "source": [
- "context.show(statistics_gen.outputs['statistics'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "oqXFJyoO9O8-"
- },
- "source": [
- "## SchemaGen\n",
- "\n",
- "The `SchemaGen` component generates a schema based on your data statistics. (A schema defines the expected bounds, types, and properties of the features in your dataset.) It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "Note: The generated schema is best-effort and only tries to infer basic properties of the data. It is expected that you review and modify it as needed.\n",
- "\n",
- "`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "PpPFaV6tX5wQ",
- "tags": []
- },
- "outputs": [],
- "source": [
- "schema_gen = tfx.components.SchemaGen(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " infer_feature_shape=False,\n",
- " exclude_splits=['eval'],\n",
- ")\n",
- "context.run(schema_gen, enable_cache=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "H6DNNUi3YAmo",
- "tags": []
- },
- "outputs": [],
- "source": [
- "context.show(schema_gen.outputs['schema'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GDdpADUb9VJR"
- },
- "source": [
- "## ExampleValidator\n",
- "\n",
- "The `ExampleValidator` component detects anomalies in your data, based on the expectations defined by the schema. It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
- "\n",
- "`ExampleValidator` will take as input the statistics from `StatisticsGen`, and the schema from `SchemaGen`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "S_F5pLZ7YdZg"
- },
- "outputs": [],
- "source": [
- "example_validator = tfx.components.ExampleValidator(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " exclude_splits=['eval'],\n",
- ")\n",
- "context.run(example_validator, enable_cache=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "After `ExampleValidator` finishes running, we can visualize the anomalies as a table."
- ],
- "metadata": {
- "id": "DgiXSTRawolF"
- }
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "3eAHpc2UYfk_"
- },
- "outputs": [],
- "source": [
- "context.show(example_validator.outputs['anomalies'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "7H6fecGTiFmN"
- },
- "source": [
- "## Transform\n",
- "\n",
- "For a structured and repeatable design of a TFX pipeline we will need a scalable approach to feature engineering. The `Transform` component performs feature engineering for both training and serving. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library.\n",
- "\n",
- "\n",
- "The Transform component uses a module file to supply user code for the feature engineering what we want to do, so our first step is to create that module file. We will only be working with the summary field.\n",
- "\n",
- "**Note:**\n",
- "*The %%writefile {_movies_transform_module_file} cell magic below creates and writes the contents of that cell to a file on the notebook server where this notebook is running (for example, the Colab VM). When doing this outside of a notebook you would just create a Python file.*"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "22TBUtG9ME9N"
- },
- "outputs": [],
- "source": [
- "import os\n",
- "if not os.path.exists(\"modules\"):\n",
- " os.mkdir(\"modules\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "teaCGLgfnjw_"
- },
- "outputs": [],
- "source": [
- "_transform_module_file = 'modules/_transform_module.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "rN6nRx3KnkpM"
- },
- "outputs": [],
- "source": [
- "%%writefile {_transform_module_file}\n",
- "\n",
- "import tensorflow as tf\n",
- "\n",
- "def _fill_in_missing(x, default_value):\n",
- " \"\"\"Replace missing values in a SparseTensor.\n",
- "\n",
- " Fills in missing values of `x` with the default_value.\n",
- "\n",
- " Args:\n",
- " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
- " in the second dimension.\n",
- " default_value: the value with which to replace the missing values.\n",
- "\n",
- " Returns:\n",
- " A rank 1 tensor where missing values of `x` have been filled in.\n",
- " \"\"\"\n",
- " if not isinstance(x, tf.sparse.SparseTensor):\n",
- " return x\n",
- " return tf.squeeze(\n",
- " tf.sparse.to_dense(\n",
- " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
- " default_value),\n",
- " axis=1)\n",
- "\n",
- "def preprocessing_fn(inputs):\n",
- " outputs = {}\n",
- " # outputs[\"summary\"] = _fill_in_missing(inputs[\"summary\"],\"\")\n",
- " outputs[\"summary\"] = _fill_in_missing(inputs[\"text\"],\"\")\n",
- " return outputs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "v-f5NaLTiFmO"
- },
- "outputs": [],
- "source": [
- "preprocessor = tfx.components.Transform(\n",
- " examples=example_gen.outputs['examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " module_file=os.path.abspath(_transform_module_file))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MkjIuwHeiFmO"
- },
- "outputs": [],
- "source": [
- "context.run(preprocessor, enable_cache=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OH8OkaCwJgLF"
- },
- "source": [
- "Let's take a look at some of the transformed examples and check that they are indeed processed as intended."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "bt70Z16zJHy7"
- },
- "outputs": [],
- "source": [
- "def pprint_examples(artifact, n_examples=2):\n",
- " print(\"artifact:\", artifact, \"\\n\")\n",
- " uri = os.path.join(artifact.uri, \"Split-eval\")\n",
- " print(\"uri:\", uri, \"\\n\")\n",
- " tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
- " print(\"tfrecord_filenames:\", tfrecord_filenames, \"\\n\")\n",
- " dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- " for tfrecord in dataset.take(n_examples):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example.FromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Tg4I-TvXJIuO"
- },
- "outputs": [],
- "source": [
- "pprint_examples(preprocessor.outputs['transformed_examples'].get()[0])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "mJll-vDn_eJP"
- },
- "source": [
- "## Trainer\n",
- "\n",
- "Trainer component trains an ML model, and it requires a model definition code from users.\n",
- "\n",
- "The `run_fn` function in TFX's Trainer component is the entry point for training a machine learning model. It is a user-supplied function that takes in a set of arguments and returns a model artifact.\n",
- "\n",
- "The `run_fn` function is responsible for:\n",
- "\n",
- "* Building the machine learning model.\n",
- "* Training the model on the training data.\n",
- "* Saving the trained model to the serving model directory.\n",
- "\n",
- "\n",
- "### Write model training code\n",
- "We will create a very simple fine-tuned model, with the preprocessing GPT-2 model. First, we need to create a module that contains the `run_fn` function for TFX Trainer because TFX Trainer expects the `run_fn` function to be defined in a module. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "OQPtqKG5pmpn"
- },
- "outputs": [],
- "source": [
- "model_file = \"modules/model.py\"\n",
- "model_fn = \"modules.model.run_fn\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "6drMNHJMAk7g"
- },
- "source": [
- "Now, we write the run_fn function:\n",
- "\n",
- "This run_fn function first gets the training data from the `fn_args.examples` argument. It then gets the schema of the training data from the `fn_args.schema` argument. Next, it loads finetuned GPT-2 model along with its preprocessor. The model is then trained on the training data using the model.train() method.\n",
- "Finally, the trained model weights are saved to the `fn_args.serving_model_dir` argument.\n",
- "\n",
- "\n",
- "Now, we are going to work with Keras NLP's GPT-2 Model! You can learn about the full GPT-2 model implementation in KerasNLP on [GitHub](https://github.com/keras-team/keras-nlp/tree/r0.5/keras_nlp/models/gpt2) or can read and interactively test the model on [Google IO2023 colab notebook](https://colab.research.google.com/github/tensorflow/codelabs/blob/main/KerasNLP/io2023_workshop.ipynb#scrollTo=81EZQ0D1R8LL ).\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "B-ME_d8i2sTB"
- },
- "outputs": [],
- "source": [
- "import keras_nlp\n",
- "import keras\n",
- "import tensorflow as tf"
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "*Note: To accommodate the limited resources of a free Colab GPU, we've adjusted the GPT-2 model's `sequence_length` parameter to `128` from its default `256`. This optimization enables efficient model training on the T4 GPU, facilitating faster fine-tuning while adhering to resource constraints.*"
- ],
- "metadata": {
- "id": "NnvkSqd6AB0q"
- }
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "N9yjLDqHoFb-"
- },
- "outputs": [],
- "source": [
- "%%writefile {model_file}\n",
- "\n",
- "import os\n",
- "import time\n",
- "from absl import logging\n",
- "import keras_nlp\n",
- "import more_itertools\n",
- "import pandas as pd\n",
- "import tensorflow as tf\n",
- "import keras\n",
- "import tfx\n",
- "import tfx.components.trainer.fn_args_utils\n",
- "import gc\n",
- "\n",
- "\n",
- "_EPOCH = 1\n",
- "_BATCH_SIZE = 20\n",
- "_INITIAL_LEARNING_RATE = 5e-5\n",
- "_END_LEARNING_RATE = 0.0\n",
- "_SEQUENCE_LENGTH = 128 # default value is 256\n",
- "\n",
- "def _input_fn(file_pattern: str) -\u003e list:\n",
- " \"\"\"Retrieves training data and returns a list of articles for training.\n",
- "\n",
- " For each row in the TFRecordDataset, generated in the previous ExampleGen\n",
- " component, create a new tf.train.Example object and parse the TFRecord into\n",
- " the example object. Articles, which are initially in bytes objects, are\n",
- " decoded into a string.\n",
- "\n",
- " Args:\n",
- " file_pattern: Path to the TFRecord file of the training dataset.\n",
- "\n",
- " Returns:\n",
- " A list of training articles.\n",
- "\n",
- " Raises:\n",
- " FileNotFoundError: If TFRecord dataset is not found in the file_pattern\n",
- " directory.\n",
- " \"\"\"\n",
- "\n",
- " if os.path.basename(file_pattern) == '*':\n",
- " file_loc = os.path.dirname(file_pattern)\n",
- "\n",
- " else:\n",
- " raise FileNotFoundError(\n",
- " f\"There is no file in the current directory: '{file_pattern}.\"\n",
- " )\n",
- "\n",
- " file_paths = [os.path.join(file_loc, name) for name in os.listdir(file_loc)]\n",
- " train_articles = []\n",
- " parsed_dataset = tf.data.TFRecordDataset(file_paths, compression_type=\"GZIP\")\n",
- " for raw_record in parsed_dataset:\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(raw_record.numpy())\n",
- " train_articles.append(\n",
- " example.features.feature[\"summary\"].bytes_list.value[0].decode('utf-8')\n",
- " )\n",
- " return train_articles\n",
- "\n",
- "def run_fn(fn_args: tfx.components.trainer.fn_args_utils.FnArgs) -\u003e None:\n",
- " \"\"\"Trains the model and outputs the trained model to a the desired location given by FnArgs.\n",
- "\n",
- " Args:\n",
- " FnArgs : Args to pass to user defined training/tuning function(s)\n",
- " \"\"\"\n",
- "\n",
- " train_articles = pd.Series(_input_fn(\n",
- " fn_args.train_files[0],\n",
- " ))\n",
- " tf_train_ds = tf.data.Dataset.from_tensor_slices(train_articles)\n",
- "\n",
- " gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(\n",
- " 'gpt2_base_en',\n",
- " sequence_length=_SEQUENCE_LENGTH,\n",
- " add_end_token=True,\n",
- " )\n",
- " gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(\n",
- " 'gpt2_base_en', preprocessor=gpt2_preprocessor\n",
- " )\n",
- "\n",
- " processed_ds = (\n",
- " tf_train_ds\n",
- " .batch(_BATCH_SIZE)\n",
- " .cache()\n",
- " .prefetch(tf.data.AUTOTUNE)\n",
- " )\n",
- "\n",
- " gpt2_lm.include_preprocessing = False\n",
- "\n",
- " lr = tf.keras.optimizers.schedules.PolynomialDecay(\n",
- " 5e-5,\n",
- " decay_steps=processed_ds.cardinality() * _EPOCH,\n",
- " end_learning_rate=0.0,\n",
- " )\n",
- " loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n",
- "\n",
- " gpt2_lm.compile(\n",
- " optimizer=keras.optimizers.Adam(lr),\n",
- " loss=loss,\n",
- " weighted_metrics=['accuracy'],\n",
- " )\n",
- "\n",
- " gpt2_lm.fit(processed_ds, epochs=_EPOCH)\n",
- " if os.path.exists(fn_args.serving_model_dir):\n",
- " os.rmdir(fn_args.serving_model_dir)\n",
- " os.mkdir(fn_args.serving_model_dir)\n",
- " gpt2_lm.save_weights(\n",
- " filepath=os.path.join(fn_args.serving_model_dir, \"model_weights.weights.h5\")\n",
- " )\n",
- " del gpt2_lm, gpt2_preprocessor, processed_ds, tf_train_ds\n",
- " gc.collect()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "bnbMFKqc5gfK"
- },
- "outputs": [],
- "source": [
- "trainer = tfx.components.Trainer(\n",
- " run_fn=model_fn,\n",
- " examples=preprocessor.outputs['transformed_examples'],\n",
- " train_args=tfx.proto.TrainArgs(splits=['train']),\n",
- " eval_args=tfx.proto.EvalArgs(splits=['train']),\n",
- " schema=schema_gen.outputs['schema'],\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "COCqeu-8CyHN"
- },
- "outputs": [],
- "source": [
- "context.run(trainer, enable_cache=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "btljwhMwWeQ9"
- },
- "source": [
- "## Inference and Evaluation\n",
- "\n",
- "With our model fine-tuned, let's evaluate its performance by generating inferences. To capture and preserve these results, we'll create an EvaluationMetric artifact.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "S79afpeeVkwc"
- },
- "outputs": [],
- "source": [
- "from tfx.types import artifact\n",
- "from tfx import types\n",
- "\n",
- "Property = artifact.Property\n",
- "PropertyType = artifact.PropertyType\n",
- "\n",
- "DURATION_PROPERTY = Property(type=PropertyType.FLOAT)\n",
- "EVAL_OUTPUT_PROPERTY = Property(type=PropertyType.STRING)\n",
- "\n",
- "class EvaluationMetric(types.Artifact):\n",
- " \"\"\"Artifact that contains metrics for a model.\n",
- "\n",
- " * Properties:\n",
- "\n",
- " - 'model_prediction_time' : time it took for the model to make predictions\n",
- " based on the input text.\n",
- " - 'model_evaluation_output_path' : saves the path to the CSV file that\n",
- " contains the model's prediction based on the testing inputs.\n",
- " \"\"\"\n",
- " TYPE_NAME = 'Evaluation_Metric'\n",
- " PROPERTIES = {\n",
- " 'model_prediction_time': DURATION_PROPERTY,\n",
- " 'model_evaluation_output_path': EVAL_OUTPUT_PROPERTY,\n",
- " }"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GQ3Wq2Ylb6JF"
- },
- "source": [
- "These helper functions contribute to the evaluation of a language model (LLM) by providing tools for calculating perplexity, a key metric reflecting the model's ability to predict the next word in a sequence, and by facilitating the extraction, preparation, and processing of evaluation data. The `input_fn` function retrieves training data from a specified TFRecord file, while the `trim_sentence` function ensures consistency by limiting sentence length. A lower perplexity score indicates higher prediction confidence and generally better model performance, making these functions essential for comprehensive evaluation within the LLM pipeline.\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tkXaZlsg38jI"
- },
- "outputs": [],
- "source": [
- "\"\"\"This is an evaluation component for the LLM pipeline takes in a\n",
- "standard trainer artifact and outputs a custom evaluation artifact.\n",
- "It displays the evaluation output in the colab notebook.\n",
- "\"\"\"\n",
- "import os\n",
- "import time\n",
- "import keras_nlp\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import tensorflow as tf\n",
- "import tfx.v1 as tfx\n",
- "\n",
- "def input_fn(file_pattern: str) -\u003e list:\n",
- " \"\"\"Retrieves training data and returns a list of articles for training.\n",
- "\n",
- " Args:\n",
- " file_pattern: Path to the TFRecord file of the training dataset.\n",
- "\n",
- " Returns:\n",
- " A list of test articles\n",
- "\n",
- " Raises:\n",
- " FileNotFoundError: If the file path does not exist.\n",
- " \"\"\"\n",
- " if os.path.exists(file_pattern):\n",
- " file_paths = [os.path.join(file_pattern, name) for name in os.listdir(file_pattern)]\n",
- " test_articles = []\n",
- " parsed_dataset = tf.data.TFRecordDataset(file_paths, compression_type=\"GZIP\")\n",
- " for raw_record in parsed_dataset:\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(raw_record.numpy())\n",
- " test_articles.append(\n",
- " example.features.feature[\"summary\"].bytes_list.value[0].decode('utf-8')\n",
- " )\n",
- " return test_articles\n",
- " else:\n",
- " raise FileNotFoundError(f'File path \"{file_pattern}\" does not exist.')\n",
- "\n",
- "def trim_sentence(sentence: str, max_words: int = 20):\n",
- " \"\"\"Trims the sentence to include up to the given number of words.\n",
- "\n",
- " Args:\n",
- " sentence: The sentence to trim.\n",
- " max_words: The maximum number of words to include in the trimmed sentence.\n",
- "\n",
- " Returns:\n",
- " The trimmed sentence.\n",
- " \"\"\"\n",
- " words = sentence.split(' ')\n",
- " if len(words) \u003c= max_words:\n",
- " return sentence\n",
- " return ' '.join(words[:max_words])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ypRrAQMpfEFd"
- },
- "source": [
- "![perplexity.png](images/gpt2_fine_tuning_and_conversion/perplexity.png)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "yo5fvOa9GmzL"
- },
- "source": [
- "One of the useful metrics for evaluating a Large Language Model is **Perplexity**. Perplexity is a measure of how well a language model predicts the next token in a sequence. It is calculated by taking the exponentiation of the average negative log-likelihood of the next token. A lower perplexity score indicates that the language model is better at predicting the next token.\n",
- "\n",
- "This is the *formula* for calculating perplexity.\n",
- "\n",
- " $\\text{Perplexity} = \\exp(-1 * $ Average Negative Log Likelihood $) =\n",
- " \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^T \\log p(w_t | w_{\u003ct})\\right)$.\n",
- "\n",
- "\n",
- "In this colab notebook, we calculate perplexity using [keras_nlp's perplexity](https://keras.io/api/keras_nlp/metrics/perplexity/)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kNfs9ZplgPAH"
- },
- "source": [
- "**Computing Perplexity for Base GPT-2 Model and Finetuned Model**\n",
- "\n",
- "The code below is the function which will be used later in the notebook for computing perplexity for the base GPT-2 model and the finetuned model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "27iA8w6-GlSz"
- },
- "outputs": [],
- "source": [
- "def calculate_perplexity(gpt2_model, gpt2_tokenizer, sentence) -\u003e int:\n",
- " \"\"\"Calculates perplexity of a model given a sentence.\n",
- "\n",
- " Args:\n",
- " gpt2_model: GPT-2 Language Model\n",
- " gpt2_tokenizer: A GPT-2 tokenizer using Byte-Pair Encoding subword segmentation.\n",
- " sentence: Sentence that the model's perplexity is calculated upon.\n",
- "\n",
- " Returns:\n",
- " A perplexity score.\n",
- " \"\"\"\n",
- " # gpt2_tokenizer([sentence])[0] produces a tensor containing an array of tokens that form the sentence.\n",
- " tokens = gpt2_tokenizer([sentence])[0].numpy()\n",
- " # decoded_sentences is an array containing sentences that increase by one token in size.\n",
- " # e.g. if tokens for a sentence \"I love dogs\" are [\"I\", \"love\", \"dogs\"], then decoded_sentences = [\"I love\", \"I love dogs\"]\n",
- " decoded_sentences = [gpt2_tokenizer.detokenize([tokens[:i]])[0].numpy() for i in range(1, len(tokens))]\n",
- " predictions = gpt2_model.predict(decoded_sentences)\n",
- " logits = [predictions[i - 1][i] for i in range(1, len(tokens))]\n",
- " target = tokens[1:].reshape(len(tokens) - 1, 1)\n",
- " perplexity = keras_nlp.metrics.Perplexity(from_logits=True)\n",
- " perplexity.update_state(target, logits)\n",
- " result = perplexity.result()\n",
- " return result.numpy()\n",
- "\n",
- "def average_perplexity(gpt2_model, gpt2_tokenizer, sentences):\n",
- " perplexity_lst = [calculate_perplexity(gpt2_model, gpt2_tokenizer, sent) for sent in sentences]\n",
- " return np.mean(perplexity_lst)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ELmkaY-ygbog"
- },
- "source": [
- "## Evaluator\n",
- "\n",
- "Having established the necessary helper functions for evaluation, we proceed to define the Evaluator component. This component facilitates model inference using both base and fine-tuned models, computes perplexity scores for all models, and measures inference time. The Evaluator's output provides comprehensive insights for a thorough comparison and assessment of each model's performance."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Eb5fD5vzEQJ0"
- },
- "outputs": [],
- "source": [
- "@tfx.dsl.components.component\n",
- "def Evaluator(\n",
- " examples: tfx.dsl.components.InputArtifact[\n",
- " tfx.types.standard_artifacts.Examples\n",
- " ],\n",
- " trained_model: tfx.dsl.components.InputArtifact[\n",
- " tfx.types.standard_artifacts.Model\n",
- " ],\n",
- " max_length: tfx.dsl.components.Parameter[int],\n",
- " evaluation: tfx.dsl.components.OutputArtifact[EvaluationMetric],\n",
- ") -\u003e None:\n",
- " \"\"\"Makes inferences with base model, finetuned model, TFlite model, and quantized model.\n",
- "\n",
- " Args:\n",
- " examples: Standard TFX examples artifacts for retreiving test dataset.\n",
- " trained_model: Standard TFX trained model artifact finetuned with imdb-reviews\n",
- " dataset.\n",
- " tflite_model: Unquantized TFLite model.\n",
- " quantized_model: Quantized TFLite model.\n",
- " max_length: Length of the text that the model generates given custom input\n",
- " statements.\n",
- " evaluation: An evaluation artifact that saves predicted outcomes of custom\n",
- " inputs in a csv document and inference speed of the model.\n",
- " \"\"\"\n",
- " _TEST_SIZE = 10\n",
- " _INPUT_LENGTH = 10\n",
- " _SEQUENCE_LENGTH = 128\n",
- "\n",
- " path = os.path.join(examples.uri, 'Split-eval')\n",
- " test_data = input_fn(path)\n",
- " evaluation_inputs = [\n",
- " trim_sentence(article, max_words=_INPUT_LENGTH)\n",
- " for article in test_data[:_TEST_SIZE]\n",
- " ]\n",
- " true_test = [\n",
- " trim_sentence(article, max_words=max_length)\n",
- " for article in test_data[:_TEST_SIZE]\n",
- " ]\n",
- "\n",
- " # Loading base model, making inference, and calculating perplexity on the base model.\n",
- " gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(\n",
- " 'gpt2_base_en',\n",
- " sequence_length=_SEQUENCE_LENGTH,\n",
- " add_end_token=True,\n",
- " )\n",
- " gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(\n",
- " 'gpt2_base_en', preprocessor=gpt2_preprocessor\n",
- " )\n",
- " gpt2_tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset('gpt2_base_en')\n",
- "\n",
- " base_average_perplexity = average_perplexity(\n",
- " gpt2_lm, gpt2_tokenizer, true_test\n",
- " )\n",
- "\n",
- " start_base_model = time.time()\n",
- " base_evaluation = [\n",
- " gpt2_lm.generate(input, max_length)\n",
- " for input in evaluation_inputs\n",
- " ]\n",
- " end_base_model = time.time()\n",
- "\n",
- " # Loading finetuned model and making inferences with the finetuned model.\n",
- " model_weights_path = os.path.join(\n",
- " trained_model.uri, \"Format-Serving\", \"model_weights.weights.h5\"\n",
- " )\n",
- " gpt2_lm.load_weights(model_weights_path)\n",
- "\n",
- " trained_model_average_perplexity = average_perplexity(\n",
- " gpt2_lm, gpt2_tokenizer, true_test\n",
- " )\n",
- "\n",
- " start_trained = time.time()\n",
- " trained_evaluation = [\n",
- " gpt2_lm.generate(input, max_length)\n",
- " for input in evaluation_inputs\n",
- " ]\n",
- " end_trained = time.time()\n",
- "\n",
- " # Building an inference table.\n",
- " inference_data = {\n",
- " 'input': evaluation_inputs,\n",
- " 'actual_test_output': true_test,\n",
- " 'base_model_prediction': base_evaluation,\n",
- " 'trained_model_prediction': trained_evaluation,\n",
- " }\n",
- "\n",
- " models = [\n",
- " 'Base Model',\n",
- " 'Finetuned Model',\n",
- " ]\n",
- " inference_time = [\n",
- " (end_base_model - start_base_model),\n",
- " (end_trained - start_trained),\n",
- " ]\n",
- " average_inference_time = [time / _TEST_SIZE for time in inference_time]\n",
- " average_perplexity_lst = [\n",
- " base_average_perplexity,\n",
- " trained_model_average_perplexity,\n",
- " ]\n",
- " evaluation_data = {\n",
- " 'Model': models,\n",
- " 'Average Inference Time (sec)': average_inference_time,\n",
- " 'Average Perplexity': average_perplexity_lst,\n",
- " }\n",
- "\n",
- " # creating directory in examples artifact to save metric dataframes\n",
- " metrics_path = os.path.join(evaluation.uri, 'metrics')\n",
- " if not os.path.exists(metrics_path):\n",
- " os.mkdir(metrics_path)\n",
- "\n",
- " evaluation_df = pd.DataFrame(evaluation_data).set_index('Model').transpose()\n",
- " evaluation_path = os.path.join(metrics_path, 'evaluation_output.csv')\n",
- " evaluation_df.to_csv(evaluation_path)\n",
- "\n",
- " inference_df = pd.DataFrame(inference_data)\n",
- " inference_path = os.path.join(metrics_path, 'inference_output.csv')\n",
- " inference_df.to_csv(inference_path)\n",
- " evaluation.model_evaluation_output_path = inference_path"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "UkC0RrleWP9O"
- },
- "outputs": [],
- "source": [
- "evaluator = Evaluator(examples = preprocessor.outputs['transformed_examples'],\n",
- " trained_model = trainer.outputs['model'],\n",
- " max_length = 50)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "KQQvbT96XXDT"
- },
- "outputs": [],
- "source": [
- "context.run(evaluator, enable_cache = False)"
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "### Evaluator Results"
- ],
- "metadata": {
- "id": "xVUIimCogdjZ"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Once our evaluation component execution is completed, we will load the evaluation metrics from evaluator URI and display them.\n",
- "\n",
- "\n",
- "*Note:*\n",
- "\n",
- "**Perplexity Calculation:**\n",
- "*Perplexity is only one of many ways to evaluate LLMs. LLM evaluation is an [active research topic](https://arxiv.org/abs/2307.03109) and a comprehensive treatment is beyond the scope of this notebook.*"
- ],
- "metadata": {
- "id": "EPKArU8f3FpD"
- }
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "NVv5F_Ok7Jss"
- },
- "outputs": [],
- "source": [
- "evaluation_path = os.path.join(evaluator.outputs['evaluation']._artifacts[0].uri, 'metrics')\n",
- "inference_df = pd.read_csv(os.path.join(evaluation_path, 'inference_output.csv'), index_col=0)\n",
- "evaluation_df = pd.read_csv(os.path.join(evaluation_path, 'evaluation_output.csv'), index_col=0)"
- ]
- },
- {
- "metadata": {
- "id": "qndIFspM9ELf"
- },
- "cell_type": "markdown",
- "source": [
- "The fine-tuned GPT-2 model exhibits a slight improvement in perplexity compared to the baseline model. Further training with more epochs or a larger dataset may yield more substantial perplexity reductions."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XvtAnvrm6H-a"
- },
- "outputs": [],
- "source": [
- "from IPython import display\n",
- "display.display(display.HTML(inference_df.to_html()))\n",
- "display.display(display.HTML(evaluation_df.to_html()))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "RiCy6OQ7J3C5"
- },
- "source": [
- "# Running the Entire Pipeline"
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "*Note: For running below section, a more substantial amount of GPU memory is required. Therefore, Colab Pro or a local machine equipped with a higher-capacity GPU is recommended for running below pipeline.*"
- ],
- "metadata": {
- "id": "AJmAdbO9AWpx"
- }
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kvYtjmkFHSxu"
- },
- "source": [
- "TFX supports multiple orchestrators to run pipelines. In this tutorial we will use LocalDagRunner which is included in the TFX Python package and runs pipelines on local environment. We often call TFX pipelines \"DAGs\" which stands for directed acyclic graph.\n",
- "\n",
- "LocalDagRunner provides fast iterations for development and debugging. TFX also supports other orchestrators including Kubeflow Pipelines and Apache Airflow which are suitable for production use cases. See [TFX on Cloud AI Platform Pipelines](/tutorials/tfx/cloud-ai-platform-pipelines) or [TFX Airflow](/tutorials/tfx/airflow_workshop) Tutorial to learn more about other orchestration systems.\n",
- "\n",
- "Now we create a LocalDagRunner and pass a Pipeline object created from the function we already defined. The pipeline runs directly and you can see logs for the progress of the pipeline including ML model training."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4FQgyxOQLn22"
- },
- "outputs": [],
- "source": [
- "import urllib.request\n",
- "import tempfile\n",
- "import os\n",
- "\n",
- "PIPELINE_NAME = \"tfx-llm-imdb-reviews\"\n",
- "model_fn = \"modules.model.run_fn\"\n",
- "_transform_module_file = \"modules/_transform_module.py\"\n",
- "\n",
- "# Output directory to store artifacts generated from the pipeline.\n",
- "PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)\n",
- "# Path to a SQLite DB file to use as an MLMD storage.\n",
- "METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')\n",
- "# Output directory where created models from the pipeline will be exported.\n",
- "SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)\n",
- "\n",
- "from absl import logging\n",
- "logging.set_verbosity(logging.INFO) # Set default logging level."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tgTwBpN-pe3_"
- },
- "outputs": [],
- "source": [
- "def _create_pipeline(\n",
- " pipeline_name: str,\n",
- " pipeline_root: str,\n",
- " model_fn: str,\n",
- " serving_model_dir: str,\n",
- " metadata_path: str,\n",
- ") -\u003e tfx.dsl.Pipeline:\n",
- " \"\"\"Creates a Pipeline for Fine-Tuning and Converting an Large Language Model with TFX.\"\"\"\n",
- "\n",
- " example_gen = FileBasedExampleGen(\n",
- " input_base='dummy',\n",
- " custom_config={'dataset':'imdb_reviews', 'split':'train[:5%]'},\n",
- " custom_executor_spec=executor_spec.BeamExecutorSpec(TFDSExecutor))\n",
- "\n",
- " statistics_gen = tfx.components.StatisticsGen(\n",
- " examples=example_gen.outputs['examples'], exclude_splits=['eval']\n",
- " )\n",
- "\n",
- " schema_gen = tfx.components.SchemaGen(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " infer_feature_shape=False,\n",
- " exclude_splits=['eval'],\n",
- " )\n",
- "\n",
- " example_validator = tfx.components.ExampleValidator(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " exclude_splits=['eval'],\n",
- " )\n",
- "\n",
- " preprocessor = tfx.components.Transform(\n",
- " examples=example_gen.outputs['examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " module_file= _transform_module_file,\n",
- " )\n",
- "\n",
- " trainer = tfx.components.Trainer(\n",
- " run_fn=model_fn,\n",
- " examples=preprocessor.outputs['transformed_examples'],\n",
- " train_args=tfx.proto.TrainArgs(splits=['train']),\n",
- " eval_args=tfx.proto.EvalArgs(splits=['train']),\n",
- " schema=schema_gen.outputs['schema'],\n",
- " )\n",
- "\n",
- "\n",
- " evaluator = Evaluator(\n",
- " examples=preprocessor.outputs['transformed_examples'],\n",
- " trained_model=trainer.outputs['model'],\n",
- " max_length=50,\n",
- " )\n",
- "\n",
- " # Following 7 components will be included in the pipeline.\n",
- " components = [\n",
- " example_gen,\n",
- " statistics_gen,\n",
- " schema_gen,\n",
- " example_validator,\n",
- " preprocessor,\n",
- " trainer,\n",
- " evaluator,\n",
- " ]\n",
- "\n",
- " return tfx.dsl.Pipeline(\n",
- " pipeline_name=pipeline_name,\n",
- " pipeline_root=pipeline_root,\n",
- " metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(\n",
- " metadata_path\n",
- " ),\n",
- " components=components,\n",
- " )"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "DkgLXyZGJ9CO"
- },
- "outputs": [],
- "source": [
- "tfx.orchestration.LocalDagRunner().run(\n",
- " _create_pipeline(\n",
- " pipeline_name=PIPELINE_NAME,\n",
- " pipeline_root=PIPELINE_ROOT,\n",
- " model_fn=model_fn,\n",
- " serving_model_dir=SERVING_MODEL_DIR,\n",
- " metadata_path=METADATA_PATH,\n",
- " )\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Mo3Z08xzHa4G"
- },
- "source": [
- "You should see INFO:absl:Component Evaluator is finished.\" at the end of the logs if the pipeline finished successfully because evaluator component is the last component of the pipeline."
- ]
- }
- ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HU9YYythm0dx"
+ },
+ "source": [
+ "### Why is this pipeline useful?\n",
+ "\n",
+ "TFX pipelines provide a powerful and structured approach to building and managing machine learning workflows, particularly those involving large language models. They offer significant advantages over traditional Python code, including:\n",
+ "\n",
+ "1. Enhanced Reproducibility: TFX pipelines ensure consistent results by capturing all steps and dependencies, eliminating the inconsistencies often associated with manual workflows.\n",
+ "\n",
+ "2. Scalability and Modularity: TFX allows for breaking down complex workflows into manageable, reusable components, promoting code organization.\n",
+ "\n",
+ "3. Streamlined Fine-Tuning and Conversion: The pipeline structure streamlines the fine-tuning and conversion processes of large language models, significantly reducing manual effort and time.\n",
+ "\n",
+ "4. Comprehensive Lineage Tracking: Through metadata tracking, TFX pipelines provide a clear understanding of data and model provenance, making debugging, auditing, and performance analysis much easier and more efficient.\n",
+ "\n",
+ "By leveraging the benefits of TFX pipelines, organizations can effectively manage the complexity of large language model development and deployment, achieving greater efficiency and control over their machine learning processes.\n",
+ "\n",
+ "### Note\n",
+ "*GPT-2 is used here only to demonstrate the end-to-end process; the techniques and tooling introduced in this codelab are potentially transferrable to other generative language models such as Google T5.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2WgJ8Z8gJB0s"
+ },
+ "source": [
+ "## Before You Begin\n",
+ "\n",
+ "Colab offers different kinds of runtimes. Make sure to go to **Runtime -> Change runtime type** and choose the GPU Hardware Accelerator runtime since you will finetune the GPT-2 model.\n",
+ "\n",
+ "**This tutorial's interactive pipeline is designed to function seamlessly with free Colab GPUs. However, for users opting to run the pipeline using the LocalDagRunner orchestrator (code provided at the end of this tutorial), a more substantial amount of GPU memory is required. Therefore, Colab Pro or a local machine equipped with a higher-capacity GPU is recommended for this approach.**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-sj3HvNcJEgC"
+ },
+ "source": [
+ "## Set Up\n",
+ "\n",
+ "We first install required python packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "73c9sPckJFSi"
+ },
+ "source": [
+ "### Upgrade Pip\n",
+ "To avoid upgrading Pip in a system when running locally, check to make sure that we are running in Colab. Local systems can of course be upgraded separately."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "45pIxa6afWOf",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "try:\n",
+ " import colab\n",
+ " !pip install --upgrade pip\n",
+ "\n",
+ "except:\n",
+ " pass"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yIf40NdqJLAH"
+ },
+ "source": [
+ "### Install TFX, Keras 3, KerasNLP and required Libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "A6mBN4dzfct7",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q tfx tensorflow-text more_itertools tensorflow_datasets\n",
+ "!pip install -q --upgrade keras-nlp\n",
+ "!pip install -q --upgrade keras"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KnyILJ-k3NAy"
+ },
+ "source": [
+ "*Note: pip's dependency resolver errors can be ignored. The required packages for this tutorial works as expected.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "V0tnFDm6JRq_",
+ "tags": []
+ },
+ "source": [
+ "### Did you restart the runtime?\n",
+ "\n",
+ "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime by clicking above \"RESTART SESSION\" button or using `\"Runtime > Restart session\"` menu. This is because of the way that Colab loads packages.\n",
+ "\n",
+ "Let's check the TensorFlow, Keras, Keras-nlp and TFX library versions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Hf5FbRzcfpMg",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "os.environ[\"KERAS_BACKEND\"] = \"tensorflow\"\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "print('TensorFlow version: {}'.format(tf.__version__))\n",
+ "from tfx import v1 as tfx\n",
+ "print('TFX version: {}'.format(tfx.__version__))\n",
+ "import keras\n",
+ "print('Keras version: {}'.format(keras.__version__))\n",
+ "import keras_nlp\n",
+ "print('Keras NLP version: {}'.format(keras_nlp.__version__))\n",
+ "\n",
+ "keras.mixed_precision.set_global_policy(\"mixed_float16\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ng1a9cCAtepl"
+ },
+ "source": [
+ "### Using TFX Interactive Context"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "k7ikXCc7v7Rh"
+ },
+ "source": [
+ "An interactive context is used to provide global context when running a TFX pipeline in a notebook without using a runner or orchestrator such as Apache Airflow or Kubeflow. This style of development is only useful when developing the code for a pipeline, and cannot currently be used to deploy a working pipeline to production."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "TEge2nYDfwaM",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
+ "context = InteractiveContext()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GF6Kk3MLxxCC"
+ },
+ "source": [
+ "## Pipeline Overview\n",
+ "\n",
+ "Below are the components that this pipeline follows.\n",
+ "\n",
+ "* Custom Artifacts are artifacts that we have created for this pipeline. **Artifacts** are data that is produced by a component or consumed by a component. Artifacts are stored in a system for managing the storage and versioning of artifacts called MLMD.\n",
+ "\n",
+ "* **Components** are defined as the implementation of an ML task that you can use as a step in your pipeline\n",
+ "* Aside from artifacts, **Parameters** are passed into the components to specify an argument.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BIBO-ueGVVHa"
+ },
+ "source": [
+ "## ExampleGen\n",
+ "We create a custom ExampleGen component which we use to load a TensorFlow Datasets (TFDS) dataset. This uses a custom executor in a FileBasedExampleGen.\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "pgvIaoAmXFVp",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "from typing import Any, Dict, List, Text\n",
+ "import tensorflow_datasets as tfds\n",
+ "import apache_beam as beam\n",
+ "import json\n",
+ "from tfx.components.example_gen.base_example_gen_executor import BaseExampleGenExecutor\n",
+ "from tfx.components.example_gen.component import FileBasedExampleGen\n",
+ "from tfx.components.example_gen import utils\n",
+ "from tfx.dsl.components.base import executor_spec\n",
+ "import os\n",
+ "import pprint\n",
+ "pp = pprint.PrettyPrinter()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Cjd9Z6SpVRCE",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "@beam.ptransform_fn\n",
+ "@beam.typehints.with_input_types(beam.Pipeline)\n",
+ "@beam.typehints.with_output_types(tf.train.Example)\n",
+ "def _TFDatasetToExample(\n",
+ " pipeline: beam.Pipeline,\n",
+ " exec_properties: Dict[str, Any],\n",
+ " split_pattern: str\n",
+ " ) -> beam.pvalue.PCollection:\n",
+ " \"\"\"Read a TensorFlow Dataset and create tf.Examples\"\"\"\n",
+ " custom_config = json.loads(exec_properties['custom_config'])\n",
+ " dataset_name = custom_config['dataset']\n",
+ " split_name = custom_config['split']\n",
+ "\n",
+ " builder = tfds.builder(dataset_name)\n",
+ " builder.download_and_prepare()\n",
+ "\n",
+ " return (pipeline\n",
+ " | 'MakeExamples' >> tfds.beam.ReadFromTFDS(builder, split=split_name)\n",
+ " | 'AsNumpy' >> beam.Map(tfds.as_numpy)\n",
+ " | 'ToDict' >> beam.Map(dict)\n",
+ " | 'ToTFExample' >> beam.Map(utils.dict_to_example)\n",
+ " )\n",
+ "\n",
+ "class TFDSExecutor(BaseExampleGenExecutor):\n",
+ " def GetInputSourceToExamplePTransform(self) -> beam.PTransform:\n",
+ " \"\"\"Returns PTransform for TF Dataset to TF examples.\"\"\"\n",
+ " return _TFDatasetToExample"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2D159hAzJgK2"
+ },
+ "source": [
+ "For this demonstration, we're using a subset of the IMDb reviews dataset, representing 20% of the total data. This allows for a more manageable training process. You can modify the \"custom_config\" settings to experiment with larger amounts of data, up to the full dataset, depending on your computational resources."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "nNDu1ECBXuvI",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "example_gen = FileBasedExampleGen(\n",
+ " input_base='dummy',\n",
+ " custom_config={'dataset':'imdb_reviews', 'split':'train[:20%]'},\n",
+ " custom_executor_spec=executor_spec.BeamExecutorSpec(TFDSExecutor))\n",
+ "context.run(example_gen, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "74JGpvIgJgK2"
+ },
+ "source": [
+ "We've developed a handy utility for examining datasets composed of TFExamples. When used with the reviews dataset, this tool returns a clear dictionary containing both the text and the corresponding label."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "GA8VMXKogXxB",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def inspect_examples(component,\n",
+ " channel_name='examples',\n",
+ " split_name='train',\n",
+ " num_examples=1):\n",
+ " # Get the URI of the output artifact, which is a directory\n",
+ " full_split_name = 'Split-{}'.format(split_name)\n",
+ " print('channel_name: {}, split_name: {} (\\\"{}\\\"), num_examples: {}\\n'.format(\n",
+ " channel_name, split_name, full_split_name, num_examples))\n",
+ " train_uri = os.path.join(\n",
+ " component.outputs[channel_name].get()[0].uri, full_split_name)\n",
+ " print('train_uri: {}'.format(train_uri))\n",
+ "\n",
+ " # Get the list of files in this directory (all compressed TFRecord files)\n",
+ " tfrecord_filenames = [os.path.join(train_uri, name)\n",
+ " for name in os.listdir(train_uri)]\n",
+ "\n",
+ " # Create a `TFRecordDataset` to read these files\n",
+ " dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ "\n",
+ " # Iterate over the records and print them\n",
+ " print()\n",
+ " for tfrecord in dataset.take(num_examples):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rcUvtz5egaIy",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "inspect_examples(example_gen, num_examples=1, split_name='eval')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gVmx7JHK8RkO"
+ },
+ "source": [
+ "## StatisticsGen\n",
+ "\n",
+ "`StatisticsGen` component computes statistics over your dataset for data analysis, such as the number of examples, the number of features, and the data types of the features. It uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library. `StatisticsGen` takes as input the dataset we just ingested using `ExampleGen`.\n",
+ "\n",
+ "*Note that the statistics generator is appropriate for tabular data, and therefore, text dataset for this LLM tutorial may not be the optimal dataset for the analysis with statistics generator.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "TzeNGNEnyq_d",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "from tfx.components import StatisticsGen"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "xWWl7LeRKsXA",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "statistics_gen = tfx.components.StatisticsGen(\n",
+ " examples=example_gen.outputs['examples'], exclude_splits=['eval']\n",
+ ")\n",
+ "context.run(statistics_gen, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "LnWKjMyIVVB7"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(statistics_gen.outputs['statistics'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oqXFJyoO9O8-"
+ },
+ "source": [
+ "## SchemaGen\n",
+ "\n",
+ "The `SchemaGen` component generates a schema based on your data statistics. (A schema defines the expected bounds, types, and properties of the features in your dataset.) It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "Note: The generated schema is best-effort and only tries to infer basic properties of the data. It is expected that you review and modify it as needed.\n",
+ "\n",
+ "`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "PpPFaV6tX5wQ",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "schema_gen = tfx.components.SchemaGen(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " infer_feature_shape=False,\n",
+ " exclude_splits=['eval'],\n",
+ ")\n",
+ "context.run(schema_gen, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "H6DNNUi3YAmo",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "context.show(schema_gen.outputs['schema'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GDdpADUb9VJR"
+ },
+ "source": [
+ "## ExampleValidator\n",
+ "\n",
+ "The `ExampleValidator` component detects anomalies in your data, based on the expectations defined by the schema. It also uses the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) library.\n",
+ "\n",
+ "`ExampleValidator` will take as input the statistics from `StatisticsGen`, and the schema from `SchemaGen`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "S_F5pLZ7YdZg"
+ },
+ "outputs": [],
+ "source": [
+ "example_validator = tfx.components.ExampleValidator(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " exclude_splits=['eval'],\n",
+ ")\n",
+ "context.run(example_validator, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DgiXSTRawolF"
+ },
+ "source": [
+ "After `ExampleValidator` finishes running, we can visualize the anomalies as a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "3eAHpc2UYfk_"
+ },
+ "outputs": [],
+ "source": [
+ "context.show(example_validator.outputs['anomalies'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7H6fecGTiFmN"
+ },
+ "source": [
+ "## Transform\n",
+ "\n",
+ "For a structured and repeatable design of a TFX pipeline we will need a scalable approach to feature engineering. The `Transform` component performs feature engineering for both training and serving. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library.\n",
+ "\n",
+ "\n",
+ "The Transform component uses a module file to supply user code for the feature engineering what we want to do, so our first step is to create that module file. We will only be working with the summary field.\n",
+ "\n",
+ "**Note:**\n",
+ "*The %%writefile {_movies_transform_module_file} cell magic below creates and writes the contents of that cell to a file on the notebook server where this notebook is running (for example, the Colab VM). When doing this outside of a notebook you would just create a Python file.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "22TBUtG9ME9N"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "if not os.path.exists(\"modules\"):\n",
+ " os.mkdir(\"modules\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "teaCGLgfnjw_"
+ },
+ "outputs": [],
+ "source": [
+ "_transform_module_file = 'modules/_transform_module.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rN6nRx3KnkpM"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_transform_module_file}\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "\n",
+ "def _fill_in_missing(x, default_value):\n",
+ " \"\"\"Replace missing values in a SparseTensor.\n",
+ "\n",
+ " Fills in missing values of `x` with the default_value.\n",
+ "\n",
+ " Args:\n",
+ " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
+ " in the second dimension.\n",
+ " default_value: the value with which to replace the missing values.\n",
+ "\n",
+ " Returns:\n",
+ " A rank 1 tensor where missing values of `x` have been filled in.\n",
+ " \"\"\"\n",
+ " if not isinstance(x, tf.sparse.SparseTensor):\n",
+ " return x\n",
+ " return tf.squeeze(\n",
+ " tf.sparse.to_dense(\n",
+ " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
+ " default_value),\n",
+ " axis=1)\n",
+ "\n",
+ "def preprocessing_fn(inputs):\n",
+ " outputs = {}\n",
+ " # outputs[\"summary\"] = _fill_in_missing(inputs[\"summary\"],\"\")\n",
+ " outputs[\"summary\"] = _fill_in_missing(inputs[\"text\"],\"\")\n",
+ " return outputs"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "v-f5NaLTiFmO"
+ },
+ "outputs": [],
+ "source": [
+ "preprocessor = tfx.components.Transform(\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " module_file=os.path.abspath(_transform_module_file))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MkjIuwHeiFmO"
+ },
+ "outputs": [],
+ "source": [
+ "context.run(preprocessor, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OH8OkaCwJgLF"
+ },
+ "source": [
+ "Let's take a look at some of the transformed examples and check that they are indeed processed as intended."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bt70Z16zJHy7"
+ },
+ "outputs": [],
+ "source": [
+ "def pprint_examples(artifact, n_examples=2):\n",
+ " print(\"artifact:\", artifact, \"\\n\")\n",
+ " uri = os.path.join(artifact.uri, \"Split-eval\")\n",
+ " print(\"uri:\", uri, \"\\n\")\n",
+ " tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
+ " print(\"tfrecord_filenames:\", tfrecord_filenames, \"\\n\")\n",
+ " dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ " for tfrecord in dataset.take(n_examples):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example.FromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Tg4I-TvXJIuO"
+ },
+ "outputs": [],
+ "source": [
+ "pprint_examples(preprocessor.outputs['transformed_examples'].get()[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mJll-vDn_eJP"
+ },
+ "source": [
+ "## Trainer\n",
+ "\n",
+ "Trainer component trains an ML model, and it requires a model definition code from users.\n",
+ "\n",
+ "The `run_fn` function in TFX's Trainer component is the entry point for training a machine learning model. It is a user-supplied function that takes in a set of arguments and returns a model artifact.\n",
+ "\n",
+ "The `run_fn` function is responsible for:\n",
+ "\n",
+ "* Building the machine learning model.\n",
+ "* Training the model on the training data.\n",
+ "* Saving the trained model to the serving model directory.\n",
+ "\n",
+ "\n",
+ "### Write model training code\n",
+ "We will create a very simple fine-tuned model, with the preprocessing GPT-2 model. First, we need to create a module that contains the `run_fn` function for TFX Trainer because TFX Trainer expects the `run_fn` function to be defined in a module. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "OQPtqKG5pmpn"
+ },
+ "outputs": [],
+ "source": [
+ "model_file = \"modules/model.py\"\n",
+ "model_fn = \"modules.model.run_fn\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6drMNHJMAk7g"
+ },
+ "source": [
+ "Now, we write the run_fn function:\n",
+ "\n",
+ "This run_fn function first gets the training data from the `fn_args.examples` argument. It then gets the schema of the training data from the `fn_args.schema` argument. Next, it loads finetuned GPT-2 model along with its preprocessor. The model is then trained on the training data using the model.train() method.\n",
+ "Finally, the trained model weights are saved to the `fn_args.serving_model_dir` argument.\n",
+ "\n",
+ "\n",
+ "Now, we are going to work with Keras NLP's GPT-2 Model! You can learn about the full GPT-2 model implementation in KerasNLP on [GitHub](https://github.com/keras-team/keras-nlp/tree/r0.5/keras_nlp/models/gpt2) or can read and interactively test the model on [Google IO2023 colab notebook](https://colab.research.google.com/github/tensorflow/codelabs/blob/main/KerasNLP/io2023_workshop.ipynb#scrollTo=81EZQ0D1R8LL ).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "B-ME_d8i2sTB"
+ },
+ "outputs": [],
+ "source": [
+ "import keras_nlp\n",
+ "import keras\n",
+ "import tensorflow as tf"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NnvkSqd6AB0q"
+ },
+ "source": [
+ "*Note: To accommodate the limited resources of a free Colab GPU, we've adjusted the GPT-2 model's `sequence_length` parameter to `128` from its default `256`. This optimization enables efficient model training on the T4 GPU, facilitating faster fine-tuning while adhering to resource constraints.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "N9yjLDqHoFb-"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {model_file}\n",
+ "\n",
+ "import os\n",
+ "import time\n",
+ "from absl import logging\n",
+ "import keras_nlp\n",
+ "import more_itertools\n",
+ "import pandas as pd\n",
+ "import tensorflow as tf\n",
+ "import keras\n",
+ "import tfx\n",
+ "import tfx.components.trainer.fn_args_utils\n",
+ "import gc\n",
+ "\n",
+ "\n",
+ "_EPOCH = 1\n",
+ "_BATCH_SIZE = 20\n",
+ "_INITIAL_LEARNING_RATE = 5e-5\n",
+ "_END_LEARNING_RATE = 0.0\n",
+ "_SEQUENCE_LENGTH = 128 # default value is 256\n",
+ "\n",
+ "def _input_fn(file_pattern: str) -> list:\n",
+ " \"\"\"Retrieves training data and returns a list of articles for training.\n",
+ "\n",
+ " For each row in the TFRecordDataset, generated in the previous ExampleGen\n",
+ " component, create a new tf.train.Example object and parse the TFRecord into\n",
+ " the example object. Articles, which are initially in bytes objects, are\n",
+ " decoded into a string.\n",
+ "\n",
+ " Args:\n",
+ " file_pattern: Path to the TFRecord file of the training dataset.\n",
+ "\n",
+ " Returns:\n",
+ " A list of training articles.\n",
+ "\n",
+ " Raises:\n",
+ " FileNotFoundError: If TFRecord dataset is not found in the file_pattern\n",
+ " directory.\n",
+ " \"\"\"\n",
+ "\n",
+ " if os.path.basename(file_pattern) == '*':\n",
+ " file_loc = os.path.dirname(file_pattern)\n",
+ "\n",
+ " else:\n",
+ " raise FileNotFoundError(\n",
+ " f\"There is no file in the current directory: '{file_pattern}.\"\n",
+ " )\n",
+ "\n",
+ " file_paths = [os.path.join(file_loc, name) for name in os.listdir(file_loc)]\n",
+ " train_articles = []\n",
+ " parsed_dataset = tf.data.TFRecordDataset(file_paths, compression_type=\"GZIP\")\n",
+ " for raw_record in parsed_dataset:\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(raw_record.numpy())\n",
+ " train_articles.append(\n",
+ " example.features.feature[\"summary\"].bytes_list.value[0].decode('utf-8')\n",
+ " )\n",
+ " return train_articles\n",
+ "\n",
+ "def run_fn(fn_args: tfx.components.trainer.fn_args_utils.FnArgs) -> None:\n",
+ " \"\"\"Trains the model and outputs the trained model to a the desired location given by FnArgs.\n",
+ "\n",
+ " Args:\n",
+ " FnArgs : Args to pass to user defined training/tuning function(s)\n",
+ " \"\"\"\n",
+ "\n",
+ " train_articles = pd.Series(_input_fn(\n",
+ " fn_args.train_files[0],\n",
+ " ))\n",
+ " tf_train_ds = tf.data.Dataset.from_tensor_slices(train_articles)\n",
+ "\n",
+ " gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(\n",
+ " 'gpt2_base_en',\n",
+ " sequence_length=_SEQUENCE_LENGTH,\n",
+ " add_end_token=True,\n",
+ " )\n",
+ " gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(\n",
+ " 'gpt2_base_en', preprocessor=gpt2_preprocessor\n",
+ " )\n",
+ "\n",
+ " processed_ds = (\n",
+ " tf_train_ds\n",
+ " .batch(_BATCH_SIZE)\n",
+ " .cache()\n",
+ " .prefetch(tf.data.AUTOTUNE)\n",
+ " )\n",
+ "\n",
+ " gpt2_lm.include_preprocessing = False\n",
+ "\n",
+ " lr = tf.keras.optimizers.schedules.PolynomialDecay(\n",
+ " 5e-5,\n",
+ " decay_steps=processed_ds.cardinality() * _EPOCH,\n",
+ " end_learning_rate=0.0,\n",
+ " )\n",
+ " loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n",
+ "\n",
+ " gpt2_lm.compile(\n",
+ " optimizer=keras.optimizers.Adam(lr),\n",
+ " loss=loss,\n",
+ " weighted_metrics=['accuracy'],\n",
+ " )\n",
+ "\n",
+ " gpt2_lm.fit(processed_ds, epochs=_EPOCH)\n",
+ " if os.path.exists(fn_args.serving_model_dir):\n",
+ " os.rmdir(fn_args.serving_model_dir)\n",
+ " os.mkdir(fn_args.serving_model_dir)\n",
+ " gpt2_lm.save_weights(\n",
+ " filepath=os.path.join(fn_args.serving_model_dir, \"model_weights.weights.h5\")\n",
+ " )\n",
+ " del gpt2_lm, gpt2_preprocessor, processed_ds, tf_train_ds\n",
+ " gc.collect()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bnbMFKqc5gfK"
+ },
+ "outputs": [],
+ "source": [
+ "trainer = tfx.components.Trainer(\n",
+ " run_fn=model_fn,\n",
+ " examples=preprocessor.outputs['transformed_examples'],\n",
+ " train_args=tfx.proto.TrainArgs(splits=['train']),\n",
+ " eval_args=tfx.proto.EvalArgs(splits=['train']),\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "COCqeu-8CyHN"
+ },
+ "outputs": [],
+ "source": [
+ "context.run(trainer, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "btljwhMwWeQ9"
+ },
+ "source": [
+ "## Inference and Evaluation\n",
+ "\n",
+ "With our model fine-tuned, let's evaluate its performance by generating inferences. To capture and preserve these results, we'll create an EvaluationMetric artifact.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "S79afpeeVkwc"
+ },
+ "outputs": [],
+ "source": [
+ "from tfx.types import artifact\n",
+ "from tfx import types\n",
+ "\n",
+ "Property = artifact.Property\n",
+ "PropertyType = artifact.PropertyType\n",
+ "\n",
+ "DURATION_PROPERTY = Property(type=PropertyType.FLOAT)\n",
+ "EVAL_OUTPUT_PROPERTY = Property(type=PropertyType.STRING)\n",
+ "\n",
+ "class EvaluationMetric(types.Artifact):\n",
+ " \"\"\"Artifact that contains metrics for a model.\n",
+ "\n",
+ " * Properties:\n",
+ "\n",
+ " - 'model_prediction_time' : time it took for the model to make predictions\n",
+ " based on the input text.\n",
+ " - 'model_evaluation_output_path' : saves the path to the CSV file that\n",
+ " contains the model's prediction based on the testing inputs.\n",
+ " \"\"\"\n",
+ " TYPE_NAME = 'Evaluation_Metric'\n",
+ " PROPERTIES = {\n",
+ " 'model_prediction_time': DURATION_PROPERTY,\n",
+ " 'model_evaluation_output_path': EVAL_OUTPUT_PROPERTY,\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GQ3Wq2Ylb6JF"
+ },
+ "source": [
+ "These helper functions contribute to the evaluation of a language model (LLM) by providing tools for calculating perplexity, a key metric reflecting the model's ability to predict the next word in a sequence, and by facilitating the extraction, preparation, and processing of evaluation data. The `input_fn` function retrieves training data from a specified TFRecord file, while the `trim_sentence` function ensures consistency by limiting sentence length. A lower perplexity score indicates higher prediction confidence and generally better model performance, making these functions essential for comprehensive evaluation within the LLM pipeline.\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tkXaZlsg38jI"
+ },
+ "outputs": [],
+ "source": [
+ "\"\"\"This is an evaluation component for the LLM pipeline takes in a\n",
+ "standard trainer artifact and outputs a custom evaluation artifact.\n",
+ "It displays the evaluation output in the colab notebook.\n",
+ "\"\"\"\n",
+ "import os\n",
+ "import time\n",
+ "import keras_nlp\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import tensorflow as tf\n",
+ "import tfx.v1 as tfx\n",
+ "\n",
+ "def input_fn(file_pattern: str) -> list:\n",
+ " \"\"\"Retrieves training data and returns a list of articles for training.\n",
+ "\n",
+ " Args:\n",
+ " file_pattern: Path to the TFRecord file of the training dataset.\n",
+ "\n",
+ " Returns:\n",
+ " A list of test articles\n",
+ "\n",
+ " Raises:\n",
+ " FileNotFoundError: If the file path does not exist.\n",
+ " \"\"\"\n",
+ " if os.path.exists(file_pattern):\n",
+ " file_paths = [os.path.join(file_pattern, name) for name in os.listdir(file_pattern)]\n",
+ " test_articles = []\n",
+ " parsed_dataset = tf.data.TFRecordDataset(file_paths, compression_type=\"GZIP\")\n",
+ " for raw_record in parsed_dataset:\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(raw_record.numpy())\n",
+ " test_articles.append(\n",
+ " example.features.feature[\"summary\"].bytes_list.value[0].decode('utf-8')\n",
+ " )\n",
+ " return test_articles\n",
+ " else:\n",
+ " raise FileNotFoundError(f'File path \"{file_pattern}\" does not exist.')\n",
+ "\n",
+ "def trim_sentence(sentence: str, max_words: int = 20):\n",
+ " \"\"\"Trims the sentence to include up to the given number of words.\n",
+ "\n",
+ " Args:\n",
+ " sentence: The sentence to trim.\n",
+ " max_words: The maximum number of words to include in the trimmed sentence.\n",
+ "\n",
+ " Returns:\n",
+ " The trimmed sentence.\n",
+ " \"\"\"\n",
+ " words = sentence.split(' ')\n",
+ " if len(words) <= max_words:\n",
+ " return sentence\n",
+ " return ' '.join(words[:max_words])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ypRrAQMpfEFd"
+ },
+ "source": [
+ "![perplexity.png](images/gpt2_fine_tuning_and_conversion/perplexity.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yo5fvOa9GmzL"
+ },
+ "source": [
+ "One of the useful metrics for evaluating a Large Language Model is **Perplexity**. Perplexity is a measure of how well a language model predicts the next token in a sequence. It is calculated by taking the exponentiation of the average negative log-likelihood of the next token. A lower perplexity score indicates that the language model is better at predicting the next token.\n",
+ "\n",
+ "This is the *formula* for calculating perplexity.\n",
+ "\n",
+ " $\\text{Perplexity} = \\exp(-1 * $ Average Negative Log Likelihood $) =\n",
+ " \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^T \\log p(w_t | w_{ int:\n",
+ " \"\"\"Calculates perplexity of a model given a sentence.\n",
+ "\n",
+ " Args:\n",
+ " gpt2_model: GPT-2 Language Model\n",
+ " gpt2_tokenizer: A GPT-2 tokenizer using Byte-Pair Encoding subword segmentation.\n",
+ " sentence: Sentence that the model's perplexity is calculated upon.\n",
+ "\n",
+ " Returns:\n",
+ " A perplexity score.\n",
+ " \"\"\"\n",
+ " # gpt2_tokenizer([sentence])[0] produces a tensor containing an array of tokens that form the sentence.\n",
+ " tokens = gpt2_tokenizer([sentence])[0].numpy()\n",
+ " # decoded_sentences is an array containing sentences that increase by one token in size.\n",
+ " # e.g. if tokens for a sentence \"I love dogs\" are [\"I\", \"love\", \"dogs\"], then decoded_sentences = [\"I love\", \"I love dogs\"]\n",
+ " decoded_sentences = [gpt2_tokenizer.detokenize([tokens[:i]])[0].numpy() for i in range(1, len(tokens))]\n",
+ " predictions = gpt2_model.predict(decoded_sentences)\n",
+ " logits = [predictions[i - 1][i] for i in range(1, len(tokens))]\n",
+ " target = tokens[1:].reshape(len(tokens) - 1, 1)\n",
+ " perplexity = keras_nlp.metrics.Perplexity(from_logits=True)\n",
+ " perplexity.update_state(target, logits)\n",
+ " result = perplexity.result()\n",
+ " return result.numpy()\n",
+ "\n",
+ "def average_perplexity(gpt2_model, gpt2_tokenizer, sentences):\n",
+ " perplexity_lst = [calculate_perplexity(gpt2_model, gpt2_tokenizer, sent) for sent in sentences]\n",
+ " return np.mean(perplexity_lst)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ELmkaY-ygbog"
+ },
+ "source": [
+ "## Evaluator\n",
+ "\n",
+ "Having established the necessary helper functions for evaluation, we proceed to define the Evaluator component. This component facilitates model inference using both base and fine-tuned models, computes perplexity scores for all models, and measures inference time. The Evaluator's output provides comprehensive insights for a thorough comparison and assessment of each model's performance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Eb5fD5vzEQJ0"
+ },
+ "outputs": [],
+ "source": [
+ "@tfx.dsl.components.component\n",
+ "def Evaluator(\n",
+ " examples: tfx.dsl.components.InputArtifact[\n",
+ " tfx.types.standard_artifacts.Examples\n",
+ " ],\n",
+ " trained_model: tfx.dsl.components.InputArtifact[\n",
+ " tfx.types.standard_artifacts.Model\n",
+ " ],\n",
+ " max_length: tfx.dsl.components.Parameter[int],\n",
+ " evaluation: tfx.dsl.components.OutputArtifact[EvaluationMetric],\n",
+ ") -> None:\n",
+ " \"\"\"Makes inferences with base model, finetuned model, TFlite model, and quantized model.\n",
+ "\n",
+ " Args:\n",
+ " examples: Standard TFX examples artifacts for retrieving test dataset.\n",
+ " trained_model: Standard TFX trained model artifact finetuned with imdb-reviews\n",
+ " dataset.\n",
+ " tflite_model: Unquantized TFLite model.\n",
+ " quantized_model: Quantized TFLite model.\n",
+ " max_length: Length of the text that the model generates given custom input\n",
+ " statements.\n",
+ " evaluation: An evaluation artifact that saves predicted outcomes of custom\n",
+ " inputs in a csv document and inference speed of the model.\n",
+ " \"\"\"\n",
+ " _TEST_SIZE = 10\n",
+ " _INPUT_LENGTH = 10\n",
+ " _SEQUENCE_LENGTH = 128\n",
+ "\n",
+ " path = os.path.join(examples.uri, 'Split-eval')\n",
+ " test_data = input_fn(path)\n",
+ " evaluation_inputs = [\n",
+ " trim_sentence(article, max_words=_INPUT_LENGTH)\n",
+ " for article in test_data[:_TEST_SIZE]\n",
+ " ]\n",
+ " true_test = [\n",
+ " trim_sentence(article, max_words=max_length)\n",
+ " for article in test_data[:_TEST_SIZE]\n",
+ " ]\n",
+ "\n",
+ " # Loading base model, making inference, and calculating perplexity on the base model.\n",
+ " gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(\n",
+ " 'gpt2_base_en',\n",
+ " sequence_length=_SEQUENCE_LENGTH,\n",
+ " add_end_token=True,\n",
+ " )\n",
+ " gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(\n",
+ " 'gpt2_base_en', preprocessor=gpt2_preprocessor\n",
+ " )\n",
+ " gpt2_tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset('gpt2_base_en')\n",
+ "\n",
+ " base_average_perplexity = average_perplexity(\n",
+ " gpt2_lm, gpt2_tokenizer, true_test\n",
+ " )\n",
+ "\n",
+ " start_base_model = time.time()\n",
+ " base_evaluation = [\n",
+ " gpt2_lm.generate(input, max_length)\n",
+ " for input in evaluation_inputs\n",
+ " ]\n",
+ " end_base_model = time.time()\n",
+ "\n",
+ " # Loading finetuned model and making inferences with the finetuned model.\n",
+ " model_weights_path = os.path.join(\n",
+ " trained_model.uri, \"Format-Serving\", \"model_weights.weights.h5\"\n",
+ " )\n",
+ " gpt2_lm.load_weights(model_weights_path)\n",
+ "\n",
+ " trained_model_average_perplexity = average_perplexity(\n",
+ " gpt2_lm, gpt2_tokenizer, true_test\n",
+ " )\n",
+ "\n",
+ " start_trained = time.time()\n",
+ " trained_evaluation = [\n",
+ " gpt2_lm.generate(input, max_length)\n",
+ " for input in evaluation_inputs\n",
+ " ]\n",
+ " end_trained = time.time()\n",
+ "\n",
+ " # Building an inference table.\n",
+ " inference_data = {\n",
+ " 'input': evaluation_inputs,\n",
+ " 'actual_test_output': true_test,\n",
+ " 'base_model_prediction': base_evaluation,\n",
+ " 'trained_model_prediction': trained_evaluation,\n",
+ " }\n",
+ "\n",
+ " models = [\n",
+ " 'Base Model',\n",
+ " 'Finetuned Model',\n",
+ " ]\n",
+ " inference_time = [\n",
+ " (end_base_model - start_base_model),\n",
+ " (end_trained - start_trained),\n",
+ " ]\n",
+ " average_inference_time = [time / _TEST_SIZE for time in inference_time]\n",
+ " average_perplexity_lst = [\n",
+ " base_average_perplexity,\n",
+ " trained_model_average_perplexity,\n",
+ " ]\n",
+ " evaluation_data = {\n",
+ " 'Model': models,\n",
+ " 'Average Inference Time (sec)': average_inference_time,\n",
+ " 'Average Perplexity': average_perplexity_lst,\n",
+ " }\n",
+ "\n",
+ " # creating directory in examples artifact to save metric dataframes\n",
+ " metrics_path = os.path.join(evaluation.uri, 'metrics')\n",
+ " if not os.path.exists(metrics_path):\n",
+ " os.mkdir(metrics_path)\n",
+ "\n",
+ " evaluation_df = pd.DataFrame(evaluation_data).set_index('Model').transpose()\n",
+ " evaluation_path = os.path.join(metrics_path, 'evaluation_output.csv')\n",
+ " evaluation_df.to_csv(evaluation_path)\n",
+ "\n",
+ " inference_df = pd.DataFrame(inference_data)\n",
+ " inference_path = os.path.join(metrics_path, 'inference_output.csv')\n",
+ " inference_df.to_csv(inference_path)\n",
+ " evaluation.model_evaluation_output_path = inference_path"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "UkC0RrleWP9O"
+ },
+ "outputs": [],
+ "source": [
+ "evaluator = Evaluator(examples = preprocessor.outputs['transformed_examples'],\n",
+ " trained_model = trainer.outputs['model'],\n",
+ " max_length = 50)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "KQQvbT96XXDT"
+ },
+ "outputs": [],
+ "source": [
+ "context.run(evaluator, enable_cache = False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xVUIimCogdjZ"
+ },
+ "source": [
+ "### Evaluator Results"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EPKArU8f3FpD"
+ },
+ "source": [
+ "Once our evaluation component execution is completed, we will load the evaluation metrics from evaluator URI and display them.\n",
+ "\n",
+ "\n",
+ "*Note:*\n",
+ "\n",
+ "**Perplexity Calculation:**\n",
+ "*Perplexity is only one of many ways to evaluate LLMs. LLM evaluation is an [active research topic](https://arxiv.org/abs/2307.03109) and a comprehensive treatment is beyond the scope of this notebook.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "NVv5F_Ok7Jss"
+ },
+ "outputs": [],
+ "source": [
+ "evaluation_path = os.path.join(evaluator.outputs['evaluation']._artifacts[0].uri, 'metrics')\n",
+ "inference_df = pd.read_csv(os.path.join(evaluation_path, 'inference_output.csv'), index_col=0)\n",
+ "evaluation_df = pd.read_csv(os.path.join(evaluation_path, 'evaluation_output.csv'), index_col=0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qndIFspM9ELf"
+ },
+ "source": [
+ "The fine-tuned GPT-2 model exhibits a slight improvement in perplexity compared to the baseline model. Further training with more epochs or a larger dataset may yield more substantial perplexity reductions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XvtAnvrm6H-a"
+ },
+ "outputs": [],
+ "source": [
+ "from IPython import display\n",
+ "display.display(display.HTML(inference_df.to_html()))\n",
+ "display.display(display.HTML(evaluation_df.to_html()))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RiCy6OQ7J3C5"
+ },
+ "source": [
+ "# Running the Entire Pipeline"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AJmAdbO9AWpx"
+ },
+ "source": [
+ "*Note: For running below section, a more substantial amount of GPU memory is required. Therefore, Colab Pro or a local machine equipped with a higher-capacity GPU is recommended for running below pipeline.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kvYtjmkFHSxu"
+ },
+ "source": [
+ "TFX supports multiple orchestrators to run pipelines. In this tutorial we will use LocalDagRunner which is included in the TFX Python package and runs pipelines on local environment. We often call TFX pipelines \"DAGs\" which stands for directed acyclic graph.\n",
+ "\n",
+ "LocalDagRunner provides fast iterations for development and debugging. TFX also supports other orchestrators including Kubeflow Pipelines and Apache Airflow which are suitable for production use cases. See [TFX on Cloud AI Platform Pipelines](/tutorials/tfx/cloud-ai-platform-pipelines) or [TFX Airflow](/tutorials/tfx/airflow_workshop) Tutorial to learn more about other orchestration systems.\n",
+ "\n",
+ "Now we create a LocalDagRunner and pass a Pipeline object created from the function we already defined. The pipeline runs directly and you can see logs for the progress of the pipeline including ML model training."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4FQgyxOQLn22"
+ },
+ "outputs": [],
+ "source": [
+ "import urllib.request\n",
+ "import tempfile\n",
+ "import os\n",
+ "\n",
+ "PIPELINE_NAME = \"tfx-llm-imdb-reviews\"\n",
+ "model_fn = \"modules.model.run_fn\"\n",
+ "_transform_module_file = \"modules/_transform_module.py\"\n",
+ "\n",
+ "# Output directory to store artifacts generated from the pipeline.\n",
+ "PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)\n",
+ "# Path to a SQLite DB file to use as an MLMD storage.\n",
+ "METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')\n",
+ "# Output directory where created models from the pipeline will be exported.\n",
+ "SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)\n",
+ "\n",
+ "from absl import logging\n",
+ "logging.set_verbosity(logging.INFO) # Set default logging level."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tgTwBpN-pe3_"
+ },
+ "outputs": [],
+ "source": [
+ "def _create_pipeline(\n",
+ " pipeline_name: str,\n",
+ " pipeline_root: str,\n",
+ " model_fn: str,\n",
+ " serving_model_dir: str,\n",
+ " metadata_path: str,\n",
+ ") -> tfx.dsl.Pipeline:\n",
+ " \"\"\"Creates a Pipeline for Fine-Tuning and Converting an Large Language Model with TFX.\"\"\"\n",
+ "\n",
+ " example_gen = FileBasedExampleGen(\n",
+ " input_base='dummy',\n",
+ " custom_config={'dataset':'imdb_reviews', 'split':'train[:5%]'},\n",
+ " custom_executor_spec=executor_spec.BeamExecutorSpec(TFDSExecutor))\n",
+ "\n",
+ " statistics_gen = tfx.components.StatisticsGen(\n",
+ " examples=example_gen.outputs['examples'], exclude_splits=['eval']\n",
+ " )\n",
+ "\n",
+ " schema_gen = tfx.components.SchemaGen(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " infer_feature_shape=False,\n",
+ " exclude_splits=['eval'],\n",
+ " )\n",
+ "\n",
+ " example_validator = tfx.components.ExampleValidator(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " exclude_splits=['eval'],\n",
+ " )\n",
+ "\n",
+ " preprocessor = tfx.components.Transform(\n",
+ " examples=example_gen.outputs['examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " module_file= _transform_module_file,\n",
+ " )\n",
+ "\n",
+ " trainer = tfx.components.Trainer(\n",
+ " run_fn=model_fn,\n",
+ " examples=preprocessor.outputs['transformed_examples'],\n",
+ " train_args=tfx.proto.TrainArgs(splits=['train']),\n",
+ " eval_args=tfx.proto.EvalArgs(splits=['train']),\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " )\n",
+ "\n",
+ "\n",
+ " evaluator = Evaluator(\n",
+ " examples=preprocessor.outputs['transformed_examples'],\n",
+ " trained_model=trainer.outputs['model'],\n",
+ " max_length=50,\n",
+ " )\n",
+ "\n",
+ " # Following 7 components will be included in the pipeline.\n",
+ " components = [\n",
+ " example_gen,\n",
+ " statistics_gen,\n",
+ " schema_gen,\n",
+ " example_validator,\n",
+ " preprocessor,\n",
+ " trainer,\n",
+ " evaluator,\n",
+ " ]\n",
+ "\n",
+ " return tfx.dsl.Pipeline(\n",
+ " pipeline_name=pipeline_name,\n",
+ " pipeline_root=pipeline_root,\n",
+ " metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(\n",
+ " metadata_path\n",
+ " ),\n",
+ " components=components,\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "DkgLXyZGJ9CO"
+ },
+ "outputs": [],
+ "source": [
+ "tfx.orchestration.LocalDagRunner().run(\n",
+ " _create_pipeline(\n",
+ " pipeline_name=PIPELINE_NAME,\n",
+ " pipeline_root=PIPELINE_ROOT,\n",
+ " model_fn=model_fn,\n",
+ " serving_model_dir=SERVING_MODEL_DIR,\n",
+ " metadata_path=METADATA_PATH,\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Mo3Z08xzHa4G"
+ },
+ "source": [
+ "You should see INFO:absl:Component Evaluator is finished.\" at the end of the logs if the pipeline finished successfully because evaluator component is the last component of the pipeline."
+ ]
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "collapsed_sections": [
+ "iwgnKVaUuozP"
+ ],
+ "gpuType": "T4",
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
}
diff --git a/docs/tutorials/tfx/neural_structured_learning.ipynb b/docs/tutorials/tfx/neural_structured_learning.ipynb
index 6011f258c3..89a8b01be1 100644
--- a/docs/tutorials/tfx/neural_structured_learning.ipynb
+++ b/docs/tutorials/tfx/neural_structured_learning.ipynb
@@ -1,55 +1,55 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "24gYiJcWNlpA"
- },
- "source": [
- "##### Copyright 2020 The TensorFlow Authors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ioaprt5q5US7"
- },
- "outputs": [],
- "source": [
- "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ItXfxkxvosLH"
- },
- "source": [
- "# Graph-based Neural Structured Learning in TFX\n",
- "\n",
- "This tutorial describes graph regularization from the\n",
- "[Neural Structured Learning](https://www.tensorflow.org/neural_structured_learning/)\n",
- "framework and demonstrates an end-to-end workflow for sentiment classification\n",
- "in a TFX pipeline."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "vyAF26z9IDoq"
- },
- "source": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "24gYiJcWNlpA"
+ },
+ "source": [
+ "##### Copyright 2020 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ioaprt5q5US7"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ItXfxkxvosLH"
+ },
+ "source": [
+ "# Graph-based Neural Structured Learning in TFX\n",
+ "\n",
+ "This tutorial describes graph regularization from the\n",
+ "[Neural Structured Learning](https://www.tensorflow.org/neural_structured_learning/)\n",
+ "framework and demonstrates an end-to-end workflow for sentiment classification\n",
+ "in a TFX pipeline."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vyAF26z9IDoq"
+ },
+ "source": [
"Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n",
"\n",
"\n",
@@ -86,1857 +86,1857 @@
" \n",
"
"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-niht8EPmUUl"
- },
- "source": [
- "\u003e Warning: Estimators are not recommended for new code. Estimators run \u003ca href=\\\"https://www.tensorflow.org/api_docs/python/tf/compat/v1/Session\\\"\u003e\u003ccode\u003ev1.Session\u003c/code\u003e\u003c/a\u003e-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our [compatibility guarantees](https://tensorflow.org/guide/versions), but will receive no fixes other than security vulnerabilities. See the [migration guide](https://tensorflow.org/guide/migrate) for details."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "z3otbdCMmJiJ"
- },
- "source": [
- "## Overview"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ApxPtg2DiTtd"
- },
- "source": [
- "This notebook classifies movie reviews as *positive* or *negative* using the\n",
- "text of the review. This is an example of *binary* classification, an important\n",
- "and widely applicable kind of machine learning problem.\n",
- "\n",
- "We will demonstrate the use of graph regularization in this notebook by building\n",
- "a graph from the given input. The general recipe for building a\n",
- "graph-regularized model using the Neural Structured Learning (NSL) framework\n",
- "when the input does not contain an explicit graph is as follows:\n",
- "\n",
- "1. Create embeddings for each text sample in the input. This can be done using\n",
- " pre-trained models such as [word2vec](https://arxiv.org/pdf/1310.4546.pdf),\n",
- " [Swivel](https://arxiv.org/abs/1602.02215),\n",
- " [BERT](https://arxiv.org/abs/1810.04805) etc.\n",
- "2. Build a graph based on these embeddings by using a similarity metric such as\n",
- " the 'L2' distance, 'cosine' distance, etc. Nodes in the graph correspond to\n",
- " samples and edges in the graph correspond to similarity between pairs of\n",
- " samples.\n",
- "3. Generate training data from the above synthesized graph and sample features.\n",
- " The resulting training data will contain neighbor features in addition to\n",
- " the original node features.\n",
- "4. Create a neural network as a base model using Estimators.\n",
- "5. Wrap the base model with the `add_graph_regularization` wrapper function,\n",
- " which is provided by the NSL framework, to create a new graph Estimator\n",
- " model. This new model will include a graph regularization loss as the\n",
- " regularization term in its training objective.\n",
- "6. Train and evaluate the graph Estimator model.\n",
- "\n",
- "In this tutorial, we integrate the above workflow in a TFX pipeline using\n",
- "several custom TFX components as well as a custom graph-regularized trainer\n",
- "component.\n",
- "\n",
- "Below is the schematic for our TFX pipeline. Orange boxes represent\n",
- "off-the-shelf TFX components and pink boxes represent custom TFX components.\n",
- "\n",
- "![TFX Pipeline](images/nsl/nsl-tfx.svg)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "EIx0r9-TeVQQ"
- },
- "source": [
- "## Upgrade Pip\n",
- "\n",
- "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "-UmVrHUfkUA2"
- },
- "outputs": [],
- "source": [
- "import sys\n",
- "if 'google.colab' in sys.modules:\n",
- " !pip install --upgrade pip"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "nDOFbB34KY1R"
- },
- "source": [
- "## Install Required Packages"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "yDUe7gk_ztZ-"
- },
- "outputs": [],
- "source": [
- "# TFX has a constraint of 1.16 due to the removal of tf.estimator support.\n",
- "!pip install -q \\\n",
- " \"tfx\u003c1.16\" \\\n",
- " neural-structured-learning \\\n",
- " tensorflow-hub \\\n",
- " tensorflow-datasets"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "1CeGS8G_eueJ"
- },
- "source": [
- "## Did you restart the runtime?\n",
- "\n",
- "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime \u003e Restart runtime ...). This is because of the way that Colab loads packages."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "x6FJ64qMNLez"
- },
- "source": [
- "## Dependencies and imports"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2ew7HTbPpCJH"
- },
- "outputs": [],
- "source": [
- "import apache_beam as beam\n",
- "import gzip as gzip_lib\n",
- "import numpy as np\n",
- "import os\n",
- "import pprint\n",
- "import shutil\n",
- "import tempfile\n",
- "import urllib\n",
- "import uuid\n",
- "pp = pprint.PrettyPrinter()\n",
- "\n",
- "import tensorflow as tf\n",
- "import neural_structured_learning as nsl\n",
- "\n",
- "import tfx\n",
- "from tfx.components.evaluator.component import Evaluator\n",
- "from tfx.components.example_gen.import_example_gen.component import ImportExampleGen\n",
- "from tfx.components.example_validator.component import ExampleValidator\n",
- "from tfx.components.model_validator.component import ModelValidator\n",
- "from tfx.components.pusher.component import Pusher\n",
- "from tfx.components.schema_gen.component import SchemaGen\n",
- "from tfx.components.statistics_gen.component import StatisticsGen\n",
- "from tfx.components.trainer import executor as trainer_executor\n",
- "from tfx.components.trainer.component import Trainer\n",
- "from tfx.components.transform.component import Transform\n",
- "from tfx.dsl.components.base import executor_spec\n",
- "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
- "from tfx.proto import evaluator_pb2\n",
- "from tfx.proto import example_gen_pb2\n",
- "from tfx.proto import pusher_pb2\n",
- "from tfx.proto import trainer_pb2\n",
- "\n",
- "from tfx.types import artifact\n",
- "from tfx.types import artifact_utils\n",
- "from tfx.types import channel\n",
- "from tfx.types import standard_artifacts\n",
- "from tfx.types.standard_artifacts import Examples\n",
- "\n",
- "from tfx.dsl.component.experimental.annotations import InputArtifact\n",
- "from tfx.dsl.component.experimental.annotations import OutputArtifact\n",
- "from tfx.dsl.component.experimental.annotations import Parameter\n",
- "from tfx.dsl.component.experimental.decorators import component\n",
- "\n",
- "from tensorflow_metadata.proto.v0 import anomalies_pb2\n",
- "from tensorflow_metadata.proto.v0 import schema_pb2\n",
- "from tensorflow_metadata.proto.v0 import statistics_pb2\n",
- "\n",
- "import tensorflow_data_validation as tfdv\n",
- "import tensorflow_transform as tft\n",
- "import tensorflow_model_analysis as tfma\n",
- "import tensorflow_hub as hub\n",
- "import tensorflow_datasets as tfds\n",
- "\n",
- "print(\"TF Version: \", tf.__version__)\n",
- "print(\"Eager mode: \", tf.executing_eagerly())\n",
- "print(\n",
- " \"GPU is\",\n",
- " \"available\" if tf.config.list_physical_devices(\"GPU\") else \"NOT AVAILABLE\")\n",
- "print(\"NSL Version: \", nsl.__version__)\n",
- "print(\"TFX Version: \", tfx.__version__)\n",
- "print(\"TFDV version: \", tfdv.__version__)\n",
- "print(\"TFT version: \", tft.__version__)\n",
- "print(\"TFMA version: \", tfma.__version__)\n",
- "print(\"Hub version: \", hub.__version__)\n",
- "print(\"Beam version: \", beam.__version__)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "nGwwFd99n42P"
- },
- "source": [
- "## IMDB dataset\n",
- "\n",
- "The\n",
- "[IMDB dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews)\n",
- "contains the text of 50,000 movie reviews from the\n",
- "[Internet Movie Database](https://www.imdb.com/). These are split into 25,000\n",
- "reviews for training and 25,000 reviews for testing. The training and testing\n",
- "sets are *balanced*, meaning they contain an equal number of positive and\n",
- "negative reviews.\n",
- "Moreover, there are 50,000 additional unlabeled movie reviews."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "iAsKG535pHep"
- },
- "source": [
- "### Download preprocessed IMDB dataset\n",
- "\n",
- "The following code downloads the IMDB dataset (or uses a cached copy if it has already been downloaded) using TFDS. To speed up this notebook we will use only 10,000 labeled reviews and 10,000 unlabeled reviews for training, and 10,000 test reviews for evaluation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "__cZi2Ic48KL"
- },
- "outputs": [],
- "source": [
- "train_set, eval_set = tfds.load(\n",
- " \"imdb_reviews:1.0.0\",\n",
- " split=[\"train[:10000]+unsupervised[:10000]\", \"test[:10000]\"],\n",
- " shuffle_files=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "nE9tNh-67Y3W"
- },
- "source": [
- "Let's look at a few reviews from the training set:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "LsnHde8T67Jz"
- },
- "outputs": [],
- "source": [
- "for tfrecord in train_set.take(4):\n",
- " print(\"Review: {}\".format(tfrecord[\"text\"].numpy().decode(\"utf-8\")[:300]))\n",
- " print(\"Label: {}\\n\".format(tfrecord[\"label\"].numpy()))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "0wG7v3rk-Cwo"
- },
- "outputs": [],
- "source": [
- "def _dict_to_example(instance):\n",
- " \"\"\"Decoded CSV to tf example.\"\"\"\n",
- " feature = {}\n",
- " for key, value in instance.items():\n",
- " if value is None:\n",
- " feature[key] = tf.train.Feature()\n",
- " elif value.dtype == np.integer:\n",
- " feature[key] = tf.train.Feature(\n",
- " int64_list=tf.train.Int64List(value=value.tolist()))\n",
- " elif value.dtype == np.float32:\n",
- " feature[key] = tf.train.Feature(\n",
- " float_list=tf.train.FloatList(value=value.tolist()))\n",
- " else:\n",
- " feature[key] = tf.train.Feature(\n",
- " bytes_list=tf.train.BytesList(value=value.tolist()))\n",
- " return tf.train.Example(features=tf.train.Features(feature=feature))\n",
- "\n",
- "\n",
- "examples_path = tempfile.mkdtemp(prefix=\"tfx-data\")\n",
- "train_path = os.path.join(examples_path, \"train.tfrecord\")\n",
- "eval_path = os.path.join(examples_path, \"eval.tfrecord\")\n",
- "\n",
- "for path, dataset in [(train_path, train_set), (eval_path, eval_set)]:\n",
- " with tf.io.TFRecordWriter(path) as writer:\n",
- " for example in dataset:\n",
- " writer.write(\n",
- " _dict_to_example({\n",
- " \"label\": np.array([example[\"label\"].numpy()]),\n",
- " \"text\": np.array([example[\"text\"].numpy()]),\n",
- " }).SerializeToString())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HdQWxfsVkzdJ"
- },
- "source": [
- "## Run TFX Components Interactively\n",
- "\n",
- "In the cells that follow you will construct TFX components and run each one interactively within the InteractiveContext to obtain `ExecutionResult` objects. This mirrors the process of an orchestrator running components in a TFX DAG based on when the dependencies for each component are met."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4aVuXUil7hil"
- },
- "outputs": [],
- "source": [
- "context = InteractiveContext()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "L9fwt9gQk3BR"
- },
- "source": [
- "### The ExampleGen Component\n",
- "In any ML development process the first step when starting code development is to ingest the training and test datasets. The `ExampleGen` component brings data into the TFX pipeline.\n",
- "\n",
- "Create an ExampleGen component and run it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "WdH4ql3Y7pT4"
- },
- "outputs": [],
- "source": [
- "input_config = example_gen_pb2.Input(splits=[\n",
- " example_gen_pb2.Input.Split(name='train', pattern='train.tfrecord'),\n",
- " example_gen_pb2.Input.Split(name='eval', pattern='eval.tfrecord')\n",
- "])\n",
- "\n",
- "example_gen = ImportExampleGen(input_base=examples_path, input_config=input_config)\n",
- "\n",
- "context.run(example_gen, enable_cache=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "IeUp6xCCrxsS"
- },
- "outputs": [],
- "source": [
- "for artifact in example_gen.outputs['examples'].get():\n",
- " print(artifact)\n",
- "\n",
- "print('\\nexample_gen.outputs is a {}'.format(type(example_gen.outputs)))\n",
- "print(example_gen.outputs)\n",
- "\n",
- "print(example_gen.outputs['examples'].get()[0].split_names)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "0SXc2OGnDWz5"
- },
- "source": [
- "The component's outputs include 2 artifacts:\n",
- "* the training examples (10,000 labeled reviews + 10,000 unlabeled reviews)\n",
- "* the eval examples (10,000 labeled reviews)\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "pcPppPASQzFa"
- },
- "source": [
- "### The IdentifyExamples Custom Component\n",
- "To use NSL, we will need each instance to have a unique ID. We create a custom\n",
- "component that adds such a unique ID to all instances across all splits. We\n",
- "leverage [Apache Beam](https://beam.apache.org) to be able to easily scale to\n",
- "large datasets if needed."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XHCUzXA5qeWe"
- },
- "outputs": [],
- "source": [
- "def make_example_with_unique_id(example, id_feature_name):\n",
- " \"\"\"Adds a unique ID to the given `tf.train.Example` proto.\n",
- "\n",
- " This function uses Python's 'uuid' module to generate a universally unique\n",
- " identifier for each example.\n",
- "\n",
- " Args:\n",
- " example: An instance of a `tf.train.Example` proto.\n",
- " id_feature_name: The name of the feature in the resulting `tf.train.Example`\n",
- " that will contain the unique identifier.\n",
- "\n",
- " Returns:\n",
- " A new `tf.train.Example` proto that includes a unique identifier as an\n",
- " additional feature.\n",
- " \"\"\"\n",
- " result = tf.train.Example()\n",
- " result.CopyFrom(example)\n",
- " unique_id = uuid.uuid4()\n",
- " result.features.feature.get_or_create(\n",
- " id_feature_name).bytes_list.MergeFrom(\n",
- " tf.train.BytesList(value=[str(unique_id).encode('utf-8')]))\n",
- " return result\n",
- "\n",
- "\n",
- "@component\n",
- "def IdentifyExamples(orig_examples: InputArtifact[Examples],\n",
- " identified_examples: OutputArtifact[Examples],\n",
- " id_feature_name: Parameter[str],\n",
- " component_name: Parameter[str]) -\u003e None:\n",
- "\n",
- " # Get a list of the splits in input_data\n",
- " splits_list = artifact_utils.decode_split_names(\n",
- " split_names=orig_examples.split_names)\n",
- " # For completeness, encode the splits names and payload_format.\n",
- " # We could also just use input_data.split_names.\n",
- " identified_examples.split_names = artifact_utils.encode_split_names(\n",
- " splits=splits_list)\n",
- " # TODO(b/168616829): Remove populating payload_format after tfx 0.25.0.\n",
- " identified_examples.set_string_custom_property(\n",
- " \"payload_format\",\n",
- " orig_examples.get_string_custom_property(\"payload_format\"))\n",
- "\n",
- "\n",
- " for split in splits_list:\n",
- " input_dir = artifact_utils.get_split_uri([orig_examples], split)\n",
- " output_dir = artifact_utils.get_split_uri([identified_examples], split)\n",
- " os.mkdir(output_dir)\n",
- " with beam.Pipeline() as pipeline:\n",
- " (pipeline\n",
- " | 'ReadExamples' \u003e\u003e beam.io.ReadFromTFRecord(\n",
- " os.path.join(input_dir, '*'),\n",
- " coder=beam.coders.coders.ProtoCoder(tf.train.Example))\n",
- " | 'AddUniqueId' \u003e\u003e beam.Map(make_example_with_unique_id, id_feature_name)\n",
- " | 'WriteIdentifiedExamples' \u003e\u003e beam.io.WriteToTFRecord(\n",
- " file_path_prefix=os.path.join(output_dir, 'data_tfrecord'),\n",
- " coder=beam.coders.coders.ProtoCoder(tf.train.Example),\n",
- " file_name_suffix='.gz'))\n",
- "\n",
- " return"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ZtLxNWHPO0je"
- },
- "outputs": [],
- "source": [
- "identify_examples = IdentifyExamples(\n",
- " orig_examples=example_gen.outputs['examples'],\n",
- " component_name=u'IdentifyExamples',\n",
- " id_feature_name=u'id')\n",
- "context.run(identify_examples, enable_cache=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "csM6BFhtk5Aa"
- },
- "source": [
- "### The StatisticsGen Component\n",
- "\n",
- "The `StatisticsGen` component computes descriptive statistics for your dataset. The statistics that it generates can be visualized for review, and are used for example validation and to infer a schema.\n",
- "\n",
- "Create a StatisticsGen component and run it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MAscCCYWgA-9"
- },
- "outputs": [],
- "source": [
- "# Computes statistics over data for visualization and example validation.\n",
- "statistics_gen = StatisticsGen(\n",
- " examples=identify_examples.outputs[\"identified_examples\"])\n",
- "context.run(statistics_gen, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "HLKLTO9Nk60p"
- },
- "source": [
- "### The SchemaGen Component\n",
- "\n",
- "The `SchemaGen` component generates a schema for your data based on the statistics from StatisticsGen. It tries to infer the data types of each of your features, and the ranges of legal values for categorical features.\n",
- "\n",
- "Create a SchemaGen component and run it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ygQvZ6hsiQ_J"
- },
- "outputs": [],
- "source": [
- "# Generates schema based on statistics files.\n",
- "schema_gen = SchemaGen(\n",
- " statistics=statistics_gen.outputs['statistics'], infer_feature_shape=False)\n",
- "context.run(schema_gen, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kdtU3u01FR-2"
- },
- "source": [
- "The generated artifact is just a `schema.pbtxt` containing a text representation of a `schema_pb2.Schema` protobuf:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "L6-tgKi6A_gK"
- },
- "outputs": [],
- "source": [
- "train_uri = schema_gen.outputs['schema'].get()[0].uri\n",
- "schema_filename = os.path.join(train_uri, 'schema.pbtxt')\n",
- "schema = tfx.utils.io_utils.parse_pbtxt_file(\n",
- " file_name=schema_filename, message=schema_pb2.Schema())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FaSgx5qIFelw"
- },
- "source": [
- "It can be visualized using `tfdv.display_schema()` (we will look at this in more detail in a subsequent lab):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "gycOsJIQFhi3"
- },
- "outputs": [],
- "source": [
- "tfdv.display_schema(schema)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "V1qcUuO9k9f8"
- },
- "source": [
- "### The ExampleValidator Component\n",
- "\n",
- "The `ExampleValidator` performs anomaly detection, based on the statistics from StatisticsGen and the schema from SchemaGen. It looks for problems such as missing values, values of the wrong type, or categorical values outside of the domain of acceptable values.\n",
- "\n",
- "Create an ExampleValidator component and run it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XRlRUuGgiXks"
- },
- "outputs": [],
- "source": [
- "# Performs anomaly detection based on statistics and data schema.\n",
- "validate_stats = ExampleValidator(\n",
- " statistics=statistics_gen.outputs['statistics'],\n",
- " schema=schema_gen.outputs['schema'])\n",
- "context.run(validate_stats, enable_cache=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "g3f2vmrF_e9b"
- },
- "source": [
- "### The SynthesizeGraph Component"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3oCuXo4BPfGr"
- },
- "source": [
- "Graph construction involves creating embeddings for text samples and then using\n",
- "a similarity function to compare the embeddings."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Gf8B3KxcinZ0"
- },
- "source": [
- "We will use pretrained Swivel embeddings to create embeddings in the\n",
- "`tf.train.Example` format for each sample in the input. We will store the\n",
- "resulting embeddings in the `TFRecord` format along with the sample's ID.\n",
- "This is important and will allow us match sample embeddings with corresponding\n",
- "nodes in the graph later."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_hSzZNdbPa4X"
- },
- "source": [
- "Once we have the sample embeddings, we will use them to build a similarity\n",
- "graph, i.e, nodes in this graph will correspond to samples and edges in this\n",
- "graph will correspond to similarity between pairs of nodes.\n",
- "\n",
- "Neural Structured Learning provides a graph building library to build a graph\n",
- "based on sample embeddings. It uses **cosine similarity** as the similarity\n",
- "measure to compare embeddings and build edges between them. It also allows us to specify a similarity threshold, which can be used to discard dissimilar edges from the final graph. In the following example, using 0.99 as the similarity threshold, we end up with a graph that has 111,066 bi-directional edges."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "nERXNfSWPa4Z"
- },
- "source": [
- "**Note:** Graph quality and by extension, embedding quality, are very important\n",
- "for graph regularization. While we use Swivel embeddings in this notebook, using BERT embeddings for instance, will likely capture review semantics more\n",
- "accurately. We encourage users to use embeddings of their choice and as appropriate to their needs."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2bAttbhgPa4V"
- },
- "outputs": [],
- "source": [
- "swivel_url = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'\n",
- "hub_layer = hub.KerasLayer(swivel_url, input_shape=[], dtype=tf.string)\n",
- "\n",
- "\n",
- "def _bytes_feature(value):\n",
- " \"\"\"Returns a bytes_list from a string / byte.\"\"\"\n",
- " return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))\n",
- "\n",
- "\n",
- "def _float_feature(value):\n",
- " \"\"\"Returns a float_list from a float / double.\"\"\"\n",
- " return tf.train.Feature(float_list=tf.train.FloatList(value=value))\n",
- "\n",
- "\n",
- "def create_embedding_example(example):\n",
- " \"\"\"Create tf.Example containing the sample's embedding and its ID.\"\"\"\n",
- " sentence_embedding = hub_layer(tf.sparse.to_dense(example['text']))\n",
- "\n",
- " # Flatten the sentence embedding back to 1-D.\n",
- " sentence_embedding = tf.reshape(sentence_embedding, shape=[-1])\n",
- "\n",
- " feature_dict = {\n",
- " 'id': _bytes_feature(tf.sparse.to_dense(example['id']).numpy()),\n",
- " 'embedding': _float_feature(sentence_embedding.numpy().tolist())\n",
- " }\n",
- "\n",
- " return tf.train.Example(features=tf.train.Features(feature=feature_dict))\n",
- "\n",
- "\n",
- "def create_dataset(uri):\n",
- " tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
- " return tf.data.TFRecordDataset(tfrecord_filenames, compression_type='GZIP')\n",
- "\n",
- "\n",
- "def create_embeddings(train_path, output_path):\n",
- " dataset = create_dataset(train_path)\n",
- " embeddings_path = os.path.join(output_path, 'embeddings.tfr')\n",
- "\n",
- " feature_map = {\n",
- " 'label': tf.io.FixedLenFeature([], tf.int64),\n",
- " 'id': tf.io.VarLenFeature(tf.string),\n",
- " 'text': tf.io.VarLenFeature(tf.string)\n",
- " }\n",
- "\n",
- " with tf.io.TFRecordWriter(embeddings_path) as writer:\n",
- " for tfrecord in dataset:\n",
- " tensor_dict = tf.io.parse_single_example(tfrecord, feature_map)\n",
- " embedding_example = create_embedding_example(tensor_dict)\n",
- " writer.write(embedding_example.SerializeToString())\n",
- "\n",
- "\n",
- "def build_graph(output_path, similarity_threshold):\n",
- " embeddings_path = os.path.join(output_path, 'embeddings.tfr')\n",
- " graph_path = os.path.join(output_path, 'graph.tsv')\n",
- " graph_builder_config = nsl.configs.GraphBuilderConfig(\n",
- " similarity_threshold=similarity_threshold,\n",
- " lsh_splits=32,\n",
- " lsh_rounds=15,\n",
- " random_seed=12345)\n",
- " nsl.tools.build_graph_from_config([embeddings_path], graph_path,\n",
- " graph_builder_config)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ITkf2SLg1TG7"
- },
- "outputs": [],
- "source": [
- "\"\"\"Custom Artifact type\"\"\"\n",
- "\n",
- "\n",
- "class SynthesizedGraph(tfx.types.artifact.Artifact):\n",
- " \"\"\"Output artifact of the SynthesizeGraph component\"\"\"\n",
- " TYPE_NAME = 'SynthesizedGraphPath'\n",
- " PROPERTIES = {\n",
- " 'span': standard_artifacts.SPAN_PROPERTY,\n",
- " 'split_names': standard_artifacts.SPLIT_NAMES_PROPERTY,\n",
- " }\n",
- "\n",
- "\n",
- "@component\n",
- "def SynthesizeGraph(identified_examples: InputArtifact[Examples],\n",
- " synthesized_graph: OutputArtifact[SynthesizedGraph],\n",
- " similarity_threshold: Parameter[float],\n",
- " component_name: Parameter[str]) -\u003e None:\n",
- "\n",
- " # Get a list of the splits in input_data\n",
- " splits_list = artifact_utils.decode_split_names(\n",
- " split_names=identified_examples.split_names)\n",
- "\n",
- " # We build a graph only based on the 'Split-train' split which includes both\n",
- " # labeled and unlabeled examples.\n",
- " train_input_examples_uri = os.path.join(identified_examples.uri,\n",
- " 'Split-train')\n",
- " output_graph_uri = os.path.join(synthesized_graph.uri, 'Split-train')\n",
- " os.mkdir(output_graph_uri)\n",
- "\n",
- " print('Creating embeddings...')\n",
- " create_embeddings(train_input_examples_uri, output_graph_uri)\n",
- "\n",
- " print('Synthesizing graph...')\n",
- " build_graph(output_graph_uri, similarity_threshold)\n",
- "\n",
- " synthesized_graph.split_names = artifact_utils.encode_split_names(\n",
- " splits=['Split-train'])\n",
- "\n",
- " return"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "H0ZkHvJMA-0G"
- },
- "outputs": [],
- "source": [
- "synthesize_graph = SynthesizeGraph(\n",
- " identified_examples=identify_examples.outputs['identified_examples'],\n",
- " component_name=u'SynthesizeGraph',\n",
- " similarity_threshold=0.99)\n",
- "context.run(synthesize_graph, enable_cache=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "o54M-0Q11FcS"
- },
- "outputs": [],
- "source": [
- "train_uri = synthesize_graph.outputs[\"synthesized_graph\"].get()[0].uri\n",
- "os.listdir(train_uri)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "IRK_rS_q1UcZ"
- },
- "outputs": [],
- "source": [
- "graph_path = os.path.join(train_uri, \"Split-train\", \"graph.tsv\")\n",
- "print(\"node 1\\t\\t\\t\\t\\tnode 2\\t\\t\\t\\t\\tsimilarity\")\n",
- "!head {graph_path}\n",
- "print(\"...\")\n",
- "!tail {graph_path}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "uybqyWztvCGm"
- },
- "outputs": [],
- "source": [
- "!wc -l {graph_path}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "JPViEz5RlA36"
- },
- "source": [
- "### The Transform Component\n",
- "\n",
- "The `Transform` component performs data transformations and feature engineering. The results include an input TensorFlow graph which is used during both training and serving to preprocess the data before training or inference. This graph becomes part of the SavedModel that is the result of model training. Since the same input graph is used for both training and serving, the preprocessing will always be the same, and only needs to be written once.\n",
- "\n",
- "The Transform component requires more code than many other components because of the arbitrary complexity of the feature engineering that you may need for the data and/or model that you're working with. It requires code files to be available which define the processing needed."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_USkfut69gNW"
- },
- "source": [
- "Each sample will include the following three features:\n",
- "\n",
- "1. **id**: The node ID of the sample.\n",
- "2. **text_xf**: An int64 list containing word IDs.\n",
- "3. **label_xf**: A singleton int64 identifying the target class of the review: 0=negative, 1=positive."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "XUYeCayFG7kH"
- },
- "source": [
- "Let's define a module containing the `preprocessing_fn()` function that we will pass to the `Transform` component:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "7uuWiQbOG9ki"
- },
- "outputs": [],
- "source": [
- "_transform_module_file = 'imdb_transform.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "v3EIuVQnBfH7"
- },
- "outputs": [],
- "source": [
- "%%writefile {_transform_module_file}\n",
- "\n",
- "import tensorflow as tf\n",
- "\n",
- "import tensorflow_transform as tft\n",
- "\n",
- "SEQUENCE_LENGTH = 100\n",
- "VOCAB_SIZE = 10000\n",
- "OOV_SIZE = 100\n",
- "\n",
- "def tokenize_reviews(reviews, sequence_length=SEQUENCE_LENGTH):\n",
- " reviews = tf.strings.lower(reviews)\n",
- " reviews = tf.strings.regex_replace(reviews, r\" '| '|^'|'$\", \" \")\n",
- " reviews = tf.strings.regex_replace(reviews, \"[^a-z' ]\", \" \")\n",
- " tokens = tf.strings.split(reviews)[:, :sequence_length]\n",
- " start_tokens = tf.fill([tf.shape(reviews)[0], 1], \"\u003cSTART\u003e\")\n",
- " end_tokens = tf.fill([tf.shape(reviews)[0], 1], \"\u003cEND\u003e\")\n",
- " tokens = tf.concat([start_tokens, tokens, end_tokens], axis=1)\n",
- " tokens = tokens[:, :sequence_length]\n",
- " tokens = tokens.to_tensor(default_value=\"\u003cPAD\u003e\")\n",
- " pad = sequence_length - tf.shape(tokens)[1]\n",
- " tokens = tf.pad(tokens, [[0, 0], [0, pad]], constant_values=\"\u003cPAD\u003e\")\n",
- " return tf.reshape(tokens, [-1, sequence_length])\n",
- "\n",
- "def preprocessing_fn(inputs):\n",
- " \"\"\"tf.transform's callback function for preprocessing inputs.\n",
- "\n",
- " Args:\n",
- " inputs: map from feature keys to raw not-yet-transformed features.\n",
- "\n",
- " Returns:\n",
- " Map from string feature key to transformed feature operations.\n",
- " \"\"\"\n",
- " outputs = {}\n",
- " outputs[\"id\"] = inputs[\"id\"]\n",
- " tokens = tokenize_reviews(_fill_in_missing(inputs[\"text\"], ''))\n",
- " outputs[\"text_xf\"] = tft.compute_and_apply_vocabulary(\n",
- " tokens,\n",
- " top_k=VOCAB_SIZE,\n",
- " num_oov_buckets=OOV_SIZE)\n",
- " outputs[\"label_xf\"] = _fill_in_missing(inputs[\"label\"], -1)\n",
- " return outputs\n",
- "\n",
- "def _fill_in_missing(x, default_value):\n",
- " \"\"\"Replace missing values in a SparseTensor.\n",
- "\n",
- " Fills in missing values of `x` with the default_value.\n",
- "\n",
- " Args:\n",
- " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
- " in the second dimension.\n",
- " default_value: the value with which to replace the missing values.\n",
- "\n",
- " Returns:\n",
- " A rank 1 tensor where missing values of `x` have been filled in.\n",
- " \"\"\"\n",
- " if not isinstance(x, tf.sparse.SparseTensor):\n",
- " return x\n",
- " return tf.squeeze(\n",
- " tf.sparse.to_dense(\n",
- " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
- " default_value),\n",
- " axis=1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "eeMVMafpHHX1"
- },
- "source": [
- "Create and run the `Transform` component, referring to the files that were created above."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "jHfhth_GiZI9"
- },
- "outputs": [],
- "source": [
- "# Performs transformations and feature engineering in training and serving.\n",
- "transform = Transform(\n",
- " examples=identify_examples.outputs['identified_examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " module_file=_transform_module_file)\n",
- "context.run(transform, enable_cache=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_jbZO1ykHOeG"
- },
- "source": [
- "The `Transform` component has 2 types of outputs:\n",
- "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
- "* `transformed_examples` represents the preprocessed training and evaluation data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "j4UjersvAC7p"
- },
- "outputs": [],
- "source": [
- "transform.outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wRFMlRcdHlQy"
- },
- "source": [
- "Take a peek at the `transform_graph` artifact: it points to a directory containing 3 subdirectories:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "E4I-cqfQQvaW"
- },
- "outputs": [],
- "source": [
- "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
- "os.listdir(train_uri)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "9374B4RpHzor"
- },
- "source": [
- "The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data. The `transformed_metadata` subdirectory contains the schema of the preprocessed data.\n",
- "\n",
- "Take a look at some of the transformed examples and check that they are indeed processed as intended."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "-QPONyzDTswf"
- },
- "outputs": [],
- "source": [
- "def pprint_examples(artifact, n_examples=3):\n",
- " print(\"artifact:\", artifact)\n",
- " uri = os.path.join(artifact.uri, \"Split-train\")\n",
- " print(\"uri:\", uri)\n",
- " tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
- " print(\"tfrecord_filenames:\", tfrecord_filenames)\n",
- " dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
- " for tfrecord in dataset.take(n_examples):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example.FromString(serialized_example)\n",
- " pp.pprint(example)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2zIepQhSQoPa"
- },
- "outputs": [],
- "source": [
- "pprint_examples(transform.outputs['transformed_examples'].get()[0])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "vpGvPKielIvI"
- },
- "source": [
- "### The GraphAugmentation Component\n",
- "\n",
- "Since we have the sample features and the synthesized graph, we can generate the\n",
- "augmented training data for Neural Structured Learning. The NSL framework\n",
- "provides a library to combine the graph and the sample features to produce\n",
- "the final training data for graph regularization. The resulting training data\n",
- "will include original sample features as well as features of their corresponding\n",
- "neighbors.\n",
- "\n",
- "In this tutorial, we consider undirected edges and use a maximum of 3 neighbors\n",
- "per sample to augment training data with graph neighbors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "gI6P_-AXGm04"
- },
- "outputs": [],
- "source": [
- "def split_train_and_unsup(input_uri):\n",
- " 'Separate the labeled and unlabeled instances.'\n",
- "\n",
- " tmp_dir = tempfile.mkdtemp(prefix='tfx-data')\n",
- " tfrecord_filenames = [\n",
- " os.path.join(input_uri, filename) for filename in os.listdir(input_uri)\n",
- " ]\n",
- " train_path = os.path.join(tmp_dir, 'train.tfrecord')\n",
- " unsup_path = os.path.join(tmp_dir, 'unsup.tfrecord')\n",
- " with tf.io.TFRecordWriter(train_path) as train_writer, \\\n",
- " tf.io.TFRecordWriter(unsup_path) as unsup_writer:\n",
- " for tfrecord in tf.data.TFRecordDataset(\n",
- " tfrecord_filenames, compression_type='GZIP'):\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(tfrecord.numpy())\n",
- " if ('label_xf' not in example.features.feature or\n",
- " example.features.feature['label_xf'].int64_list.value[0] == -1):\n",
- " writer = unsup_writer\n",
- " else:\n",
- " writer = train_writer\n",
- " writer.write(tfrecord.numpy())\n",
- " return train_path, unsup_path\n",
- "\n",
- "\n",
- "def gzip(filepath):\n",
- " with open(filepath, 'rb') as f_in:\n",
- " with gzip_lib.open(filepath + '.gz', 'wb') as f_out:\n",
- " shutil.copyfileobj(f_in, f_out)\n",
- " os.remove(filepath)\n",
- "\n",
- "\n",
- "def copy_tfrecords(input_uri, output_uri):\n",
- " for filename in os.listdir(input_uri):\n",
- " input_filename = os.path.join(input_uri, filename)\n",
- " output_filename = os.path.join(output_uri, filename)\n",
- " shutil.copyfile(input_filename, output_filename)\n",
- "\n",
- "\n",
- "@component\n",
- "def GraphAugmentation(identified_examples: InputArtifact[Examples],\n",
- " synthesized_graph: InputArtifact[SynthesizedGraph],\n",
- " augmented_examples: OutputArtifact[Examples],\n",
- " num_neighbors: Parameter[int],\n",
- " component_name: Parameter[str]) -\u003e None:\n",
- "\n",
- " # Get a list of the splits in input_data\n",
- " splits_list = artifact_utils.decode_split_names(\n",
- " split_names=identified_examples.split_names)\n",
- "\n",
- " train_input_uri = os.path.join(identified_examples.uri, 'Split-train')\n",
- " eval_input_uri = os.path.join(identified_examples.uri, 'Split-eval')\n",
- " train_graph_uri = os.path.join(synthesized_graph.uri, 'Split-train')\n",
- " train_output_uri = os.path.join(augmented_examples.uri, 'Split-train')\n",
- " eval_output_uri = os.path.join(augmented_examples.uri, 'Split-eval')\n",
- "\n",
- " os.mkdir(train_output_uri)\n",
- " os.mkdir(eval_output_uri)\n",
- "\n",
- " # Separate the labeled and unlabeled examples from the 'Split-train' split.\n",
- " train_path, unsup_path = split_train_and_unsup(train_input_uri)\n",
- "\n",
- " output_path = os.path.join(train_output_uri, 'nsl_train_data.tfr')\n",
- " pack_nbrs_args = dict(\n",
- " labeled_examples_path=train_path,\n",
- " unlabeled_examples_path=unsup_path,\n",
- " graph_path=os.path.join(train_graph_uri, 'graph.tsv'),\n",
- " output_training_data_path=output_path,\n",
- " add_undirected_edges=True,\n",
- " max_nbrs=num_neighbors)\n",
- " print('nsl.tools.pack_nbrs arguments:', pack_nbrs_args)\n",
- " nsl.tools.pack_nbrs(**pack_nbrs_args)\n",
- "\n",
- " # Downstream components expect gzip'ed TFRecords.\n",
- " gzip(output_path)\n",
- "\n",
- " # The test examples are left untouched and are simply copied over.\n",
- " copy_tfrecords(eval_input_uri, eval_output_uri)\n",
- "\n",
- " augmented_examples.split_names = identified_examples.split_names\n",
- "\n",
- " return"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "r9MIEVDiOANe"
- },
- "outputs": [],
- "source": [
- "# Augments training data with graph neighbors.\n",
- "graph_augmentation = GraphAugmentation(\n",
- " identified_examples=transform.outputs['transformed_examples'],\n",
- " synthesized_graph=synthesize_graph.outputs['synthesized_graph'],\n",
- " component_name=u'GraphAugmentation',\n",
- " num_neighbors=3)\n",
- "context.run(graph_augmentation, enable_cache=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "gpSLs3Hx8viI"
- },
- "outputs": [],
- "source": [
- "pprint_examples(graph_augmentation.outputs['augmented_examples'].get()[0], 6)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OBJFtnl6lCg9"
- },
- "source": [
- "### The Trainer Component\n",
- "\n",
- "The `Trainer` component trains models using TensorFlow.\n",
- "\n",
- "Create a Python module containing a `trainer_fn` function, which must return an estimator. If you prefer creating a Keras model, you can do so and then convert it to an estimator using `keras.model_to_estimator()`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "5ajvClE6b2pd"
- },
- "outputs": [],
- "source": [
- "# Setup paths.\n",
- "_trainer_module_file = 'imdb_trainer.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "_dh6AejVk2Oq"
- },
- "outputs": [],
- "source": [
- "%%writefile {_trainer_module_file}\n",
- "\n",
- "import neural_structured_learning as nsl\n",
- "\n",
- "import tensorflow as tf\n",
- "\n",
- "import tensorflow_model_analysis as tfma\n",
- "import tensorflow_transform as tft\n",
- "from tensorflow_transform.tf_metadata import schema_utils\n",
- "\n",
- "\n",
- "NBR_FEATURE_PREFIX = 'NL_nbr_'\n",
- "NBR_WEIGHT_SUFFIX = '_weight'\n",
- "LABEL_KEY = 'label'\n",
- "ID_FEATURE_KEY = 'id'\n",
- "\n",
- "def _transformed_name(key):\n",
- " return key + '_xf'\n",
- "\n",
- "\n",
- "def _transformed_names(keys):\n",
- " return [_transformed_name(key) for key in keys]\n",
- "\n",
- "\n",
- "# Hyperparameters:\n",
- "#\n",
- "# We will use an instance of `HParams` to inclue various hyperparameters and\n",
- "# constants used for training and evaluation. We briefly describe each of them\n",
- "# below:\n",
- "#\n",
- "# - max_seq_length: This is the maximum number of words considered from each\n",
- "# movie review in this example.\n",
- "# - vocab_size: This is the size of the vocabulary considered for this\n",
- "# example.\n",
- "# - oov_size: This is the out-of-vocabulary size considered for this example.\n",
- "# - distance_type: This is the distance metric used to regularize the sample\n",
- "# with its neighbors.\n",
- "# - graph_regularization_multiplier: This controls the relative weight of the\n",
- "# graph regularization term in the overall\n",
- "# loss function.\n",
- "# - num_neighbors: The number of neighbors used for graph regularization. This\n",
- "# value has to be less than or equal to the `num_neighbors`\n",
- "# argument used above in the GraphAugmentation component when\n",
- "# invoking `nsl.tools.pack_nbrs`.\n",
- "# - num_fc_units: The number of units in the fully connected layer of the\n",
- "# neural network.\n",
- "class HParams(object):\n",
- " \"\"\"Hyperparameters used for training.\"\"\"\n",
- " def __init__(self):\n",
- " ### dataset parameters\n",
- " # The following 3 values should match those defined in the Transform\n",
- " # Component.\n",
- " self.max_seq_length = 100\n",
- " self.vocab_size = 10000\n",
- " self.oov_size = 100\n",
- " ### Neural Graph Learning parameters\n",
- " self.distance_type = nsl.configs.DistanceType.L2\n",
- " self.graph_regularization_multiplier = 0.1\n",
- " # The following value has to be at most the value of 'num_neighbors' used\n",
- " # in the GraphAugmentation component.\n",
- " self.num_neighbors = 1\n",
- " ### Model Architecture\n",
- " self.num_embedding_dims = 16\n",
- " self.num_fc_units = 64\n",
- "\n",
- "HPARAMS = HParams()\n",
- "\n",
- "\n",
- "def optimizer_fn():\n",
- " \"\"\"Returns an instance of `tf.Optimizer`.\"\"\"\n",
- " return tf.compat.v1.train.RMSPropOptimizer(\n",
- " learning_rate=0.0001, decay=1e-6)\n",
- "\n",
- "\n",
- "def build_train_op(loss, global_step):\n",
- " \"\"\"Builds a train op to optimize the given loss using gradient descent.\"\"\"\n",
- " with tf.name_scope('train'):\n",
- " optimizer = optimizer_fn()\n",
- " train_op = optimizer.minimize(loss=loss, global_step=global_step)\n",
- " return train_op\n",
- "\n",
- "\n",
- "# Building the model:\n",
- "#\n",
- "# A neural network is created by stacking layers—this requires two main\n",
- "# architectural decisions:\n",
- "# * How many layers to use in the model?\n",
- "# * How many *hidden units* to use for each layer?\n",
- "#\n",
- "# In this example, the input data consists of an array of word-indices. The\n",
- "# labels to predict are either 0 or 1. We will use a feed-forward neural network\n",
- "# as our base model in this tutorial.\n",
- "def feed_forward_model(features, is_training, reuse=tf.compat.v1.AUTO_REUSE):\n",
- " \"\"\"Builds a simple 2 layer feed forward neural network.\n",
- "\n",
- " The layers are effectively stacked sequentially to build the classifier. The\n",
- " first layer is an Embedding layer, which takes the integer-encoded vocabulary\n",
- " and looks up the embedding vector for each word-index. These vectors are\n",
- " learned as the model trains. The vectors add a dimension to the output array.\n",
- " The resulting dimensions are: (batch, sequence, embedding). Next is a global\n",
- " average pooling 1D layer, which reduces the dimensionality of its inputs from\n",
- " 3D to 2D. This fixed-length output vector is piped through a fully-connected\n",
- " (Dense) layer with 16 hidden units. The last layer is densely connected with a\n",
- " single output node. Using the sigmoid activation function, this value is a\n",
- " float between 0 and 1, representing a probability, or confidence level.\n",
- "\n",
- " Args:\n",
- " features: A dictionary containing batch features returned from the\n",
- " `input_fn`, that include sample features, corresponding neighbor features,\n",
- " and neighbor weights.\n",
- " is_training: a Python Boolean value or a Boolean scalar Tensor, indicating\n",
- " whether to apply dropout.\n",
- " reuse: a Python Boolean value for reusing variable scope.\n",
- "\n",
- " Returns:\n",
- " logits: Tensor of shape [batch_size, 1].\n",
- " representations: Tensor of shape [batch_size, _] for graph regularization.\n",
- " This is the representation of each example at the graph regularization\n",
- " layer.\n",
- " \"\"\"\n",
- "\n",
- " with tf.compat.v1.variable_scope('ff', reuse=reuse):\n",
- " inputs = features[_transformed_name('text')]\n",
- " embeddings = tf.compat.v1.get_variable(\n",
- " 'embeddings',\n",
- " shape=[\n",
- " HPARAMS.vocab_size + HPARAMS.oov_size, HPARAMS.num_embedding_dims\n",
- " ])\n",
- " embedding_layer = tf.nn.embedding_lookup(embeddings, inputs)\n",
- "\n",
- " pooling_layer = tf.compat.v1.layers.AveragePooling1D(\n",
- " pool_size=HPARAMS.max_seq_length, strides=HPARAMS.max_seq_length)(\n",
- " embedding_layer)\n",
- " # Shape of pooling_layer is now [batch_size, 1, HPARAMS.num_embedding_dims]\n",
- " pooling_layer = tf.reshape(pooling_layer, [-1, HPARAMS.num_embedding_dims])\n",
- "\n",
- " dense_layer = tf.compat.v1.layers.Dense(\n",
- " 16, activation='relu')(\n",
- " pooling_layer)\n",
- "\n",
- " output_layer = tf.compat.v1.layers.Dense(\n",
- " 1, activation='sigmoid')(\n",
- " dense_layer)\n",
- "\n",
- " # Graph regularization will be done on the penultimate (dense) layer\n",
- " # because the output layer is a single floating point number.\n",
- " return output_layer, dense_layer\n",
- "\n",
- "\n",
- "# A note on hidden units:\n",
- "#\n",
- "# The above model has two intermediate or \"hidden\" layers, between the input and\n",
- "# output, and excluding the Embedding layer. The number of outputs (units,\n",
- "# nodes, or neurons) is the dimension of the representational space for the\n",
- "# layer. In other words, the amount of freedom the network is allowed when\n",
- "# learning an internal representation. If a model has more hidden units\n",
- "# (a higher-dimensional representation space), and/or more layers, then the\n",
- "# network can learn more complex representations. However, it makes the network\n",
- "# more computationally expensive and may lead to learning unwanted\n",
- "# patterns—patterns that improve performance on training data but not on the\n",
- "# test data. This is called overfitting.\n",
- "\n",
- "\n",
- "# This function will be used to generate the embeddings for samples and their\n",
- "# corresponding neighbors, which will then be used for graph regularization.\n",
- "def embedding_fn(features, mode, **params):\n",
- " \"\"\"Returns the embedding corresponding to the given features.\n",
- "\n",
- " Args:\n",
- " features: A dictionary containing batch features returned from the\n",
- " `input_fn`, that include sample features, corresponding neighbor features,\n",
- " and neighbor weights.\n",
- " mode: Specifies if this is training, evaluation, or prediction. See\n",
- " tf.estimator.ModeKeys.\n",
- "\n",
- " Returns:\n",
- " The embedding that will be used for graph regularization.\n",
- " \"\"\"\n",
- " is_training = (mode == tf.estimator.ModeKeys.TRAIN)\n",
- " _, embedding = feed_forward_model(features, is_training)\n",
- " return embedding\n",
- "\n",
- "\n",
- "def feed_forward_model_fn(features, labels, mode, params, config):\n",
- " \"\"\"Implementation of the model_fn for the base feed-forward model.\n",
- "\n",
- " Args:\n",
- " features: This is the first item returned from the `input_fn` passed to\n",
- " `train`, `evaluate`, and `predict`. This should be a single `Tensor` or\n",
- " `dict` of same.\n",
- " labels: This is the second item returned from the `input_fn` passed to\n",
- " `train`, `evaluate`, and `predict`. This should be a single `Tensor` or\n",
- " `dict` of same (for multi-head models). If mode is `ModeKeys.PREDICT`,\n",
- " `labels=None` will be passed. If the `model_fn`'s signature does not\n",
- " accept `mode`, the `model_fn` must still be able to handle `labels=None`.\n",
- " mode: Optional. Specifies if this training, evaluation or prediction. See\n",
- " `ModeKeys`.\n",
- " params: An HParams instance as returned by get_hyper_parameters().\n",
- " config: Optional configuration object. Will receive what is passed to\n",
- " Estimator in `config` parameter, or the default `config`. Allows updating\n",
- " things in your model_fn based on configuration such as `num_ps_replicas`,\n",
- " or `model_dir`. Unused currently.\n",
- "\n",
- " Returns:\n",
- " A `tf.estimator.EstimatorSpec` for the base feed-forward model. This does\n",
- " not include graph-based regularization.\n",
- " \"\"\"\n",
- "\n",
- " is_training = mode == tf.estimator.ModeKeys.TRAIN\n",
- "\n",
- " # Build the computation graph.\n",
- " probabilities, _ = feed_forward_model(features, is_training)\n",
- " predictions = tf.round(probabilities)\n",
- "\n",
- " if mode == tf.estimator.ModeKeys.PREDICT:\n",
- " # labels will be None, and no loss to compute.\n",
- " cross_entropy_loss = None\n",
- " eval_metric_ops = None\n",
- " else:\n",
- " # Loss is required in train and eval modes.\n",
- " # Flatten 'probabilities' to 1-D.\n",
- " probabilities = tf.reshape(probabilities, shape=[-1])\n",
- " cross_entropy_loss = tf.compat.v1.keras.losses.binary_crossentropy(\n",
- " labels, probabilities)\n",
- " eval_metric_ops = {\n",
- " 'accuracy': tf.compat.v1.metrics.accuracy(labels, predictions)\n",
- " }\n",
- "\n",
- " if is_training:\n",
- " global_step = tf.compat.v1.train.get_or_create_global_step()\n",
- " train_op = build_train_op(cross_entropy_loss, global_step)\n",
- " else:\n",
- " train_op = None\n",
- "\n",
- " return tf.estimator.EstimatorSpec(\n",
- " mode=mode,\n",
- " predictions={\n",
- " 'probabilities': probabilities,\n",
- " 'predictions': predictions\n",
- " },\n",
- " loss=cross_entropy_loss,\n",
- " train_op=train_op,\n",
- " eval_metric_ops=eval_metric_ops)\n",
- "\n",
- "\n",
- "# Tf.Transform considers these features as \"raw\"\n",
- "def _get_raw_feature_spec(schema):\n",
- " return schema_utils.schema_as_feature_spec(schema).feature_spec\n",
- "\n",
- "\n",
- "def _gzip_reader_fn(filenames):\n",
- " \"\"\"Small utility returning a record reader that can read gzip'ed files.\"\"\"\n",
- " return tf.data.TFRecordDataset(\n",
- " filenames,\n",
- " compression_type='GZIP')\n",
- "\n",
- "\n",
- "def _example_serving_receiver_fn(tf_transform_output, schema):\n",
- " \"\"\"Build the serving in inputs.\n",
- "\n",
- " Args:\n",
- " tf_transform_output: A TFTransformOutput.\n",
- " schema: the schema of the input data.\n",
- "\n",
- " Returns:\n",
- " Tensorflow graph which parses examples, applying tf-transform to them.\n",
- " \"\"\"\n",
- " raw_feature_spec = _get_raw_feature_spec(schema)\n",
- " raw_feature_spec.pop(LABEL_KEY)\n",
- "\n",
- " # We don't need the ID feature for serving.\n",
- " raw_feature_spec.pop(ID_FEATURE_KEY)\n",
- "\n",
- " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
- " raw_feature_spec, default_batch_size=None)\n",
- " serving_input_receiver = raw_input_fn()\n",
- "\n",
- " transformed_features = tf_transform_output.transform_raw_features(\n",
- " serving_input_receiver.features)\n",
- "\n",
- " # Even though, LABEL_KEY was removed from 'raw_feature_spec', the transform\n",
- " # operation would have injected the transformed LABEL_KEY feature with a\n",
- " # default value.\n",
- " transformed_features.pop(_transformed_name(LABEL_KEY))\n",
- " return tf.estimator.export.ServingInputReceiver(\n",
- " transformed_features, serving_input_receiver.receiver_tensors)\n",
- "\n",
- "\n",
- "def _eval_input_receiver_fn(tf_transform_output, schema):\n",
- " \"\"\"Build everything needed for the tf-model-analysis to run the model.\n",
- "\n",
- " Args:\n",
- " tf_transform_output: A TFTransformOutput.\n",
- " schema: the schema of the input data.\n",
- "\n",
- " Returns:\n",
- " EvalInputReceiver function, which contains:\n",
- " - Tensorflow graph which parses raw untransformed features, applies the\n",
- " tf-transform preprocessing operators.\n",
- " - Set of raw, untransformed features.\n",
- " - Label against which predictions will be compared.\n",
- " \"\"\"\n",
- " # Notice that the inputs are raw features, not transformed features here.\n",
- " raw_feature_spec = _get_raw_feature_spec(schema)\n",
- "\n",
- " # We don't need the ID feature for TFMA.\n",
- " raw_feature_spec.pop(ID_FEATURE_KEY)\n",
- "\n",
- " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
- " raw_feature_spec, default_batch_size=None)\n",
- " serving_input_receiver = raw_input_fn()\n",
- "\n",
- " transformed_features = tf_transform_output.transform_raw_features(\n",
- " serving_input_receiver.features)\n",
- "\n",
- " labels = transformed_features.pop(_transformed_name(LABEL_KEY))\n",
- " return tfma.export.EvalInputReceiver(\n",
- " features=transformed_features,\n",
- " receiver_tensors=serving_input_receiver.receiver_tensors,\n",
- " labels=labels)\n",
- "\n",
- "\n",
- "def _augment_feature_spec(feature_spec, num_neighbors):\n",
- " \"\"\"Augments `feature_spec` to include neighbor features.\n",
- " Args:\n",
- " feature_spec: Dictionary of feature keys mapping to TF feature types.\n",
- " num_neighbors: Number of neighbors to use for feature key augmentation.\n",
- " Returns:\n",
- " An augmented `feature_spec` that includes neighbor feature keys.\n",
- " \"\"\"\n",
- " for i in range(num_neighbors):\n",
- " feature_spec['{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'id')] = \\\n",
- " tf.io.VarLenFeature(dtype=tf.string)\n",
- " # We don't care about the neighbor features corresponding to\n",
- " # _transformed_name(LABEL_KEY) because the LABEL_KEY feature will be\n",
- " # removed from the feature spec during training/evaluation.\n",
- " feature_spec['{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'text_xf')] = \\\n",
- " tf.io.FixedLenFeature(shape=[HPARAMS.max_seq_length], dtype=tf.int64,\n",
- " default_value=tf.constant(0, dtype=tf.int64,\n",
- " shape=[HPARAMS.max_seq_length]))\n",
- " # The 'NL_num_nbrs' features is currently not used.\n",
- "\n",
- " # Set the neighbor weight feature keys.\n",
- " for i in range(num_neighbors):\n",
- " feature_spec['{}{}{}'.format(NBR_FEATURE_PREFIX, i, NBR_WEIGHT_SUFFIX)] = \\\n",
- " tf.io.FixedLenFeature(shape=[1], dtype=tf.float32, default_value=[0.0])\n",
- "\n",
- " return feature_spec\n",
- "\n",
- "\n",
- "def _input_fn(filenames, tf_transform_output, is_training, batch_size=200):\n",
- " \"\"\"Generates features and labels for training or evaluation.\n",
- "\n",
- " Args:\n",
- " filenames: [str] list of CSV files to read data from.\n",
- " tf_transform_output: A TFTransformOutput.\n",
- " is_training: Boolean indicating if we are in training mode.\n",
- " batch_size: int First dimension size of the Tensors returned by input_fn\n",
- "\n",
- " Returns:\n",
- " A (features, indices) tuple where features is a dictionary of\n",
- " Tensors, and indices is a single Tensor of label indices.\n",
- " \"\"\"\n",
- " transformed_feature_spec = (\n",
- " tf_transform_output.transformed_feature_spec().copy())\n",
- "\n",
- " # During training, NSL uses augmented training data (which includes features\n",
- " # from graph neighbors). So, update the feature spec accordingly. This needs\n",
- " # to be done because we are using different schemas for NSL training and eval,\n",
- " # but the Trainer Component only accepts a single schema.\n",
- " if is_training:\n",
- " transformed_feature_spec =_augment_feature_spec(transformed_feature_spec,\n",
- " HPARAMS.num_neighbors)\n",
- "\n",
- " dataset = tf.data.experimental.make_batched_features_dataset(\n",
- " filenames, batch_size, transformed_feature_spec, reader=_gzip_reader_fn)\n",
- "\n",
- " transformed_features = tf.compat.v1.data.make_one_shot_iterator(\n",
- " dataset).get_next()\n",
- " # We pop the label because we do not want to use it as a feature while we're\n",
- " # training.\n",
- " return transformed_features, transformed_features.pop(\n",
- " _transformed_name(LABEL_KEY))\n",
- "\n",
- "\n",
- "# TFX will call this function\n",
- "def trainer_fn(hparams, schema):\n",
- " \"\"\"Build the estimator using the high level API.\n",
- " Args:\n",
- " hparams: Holds hyperparameters used to train the model as name/value pairs.\n",
- " schema: Holds the schema of the training examples.\n",
- " Returns:\n",
- " A dict of the following:\n",
- " - estimator: The estimator that will be used for training and eval.\n",
- " - train_spec: Spec for training.\n",
- " - eval_spec: Spec for eval.\n",
- " - eval_input_receiver_fn: Input function for eval.\n",
- " \"\"\"\n",
- " train_batch_size = 40\n",
- " eval_batch_size = 40\n",
- "\n",
- " tf_transform_output = tft.TFTransformOutput(hparams.transform_output)\n",
- "\n",
- " train_input_fn = lambda: _input_fn(\n",
- " hparams.train_files,\n",
- " tf_transform_output,\n",
- " is_training=True,\n",
- " batch_size=train_batch_size)\n",
- "\n",
- " eval_input_fn = lambda: _input_fn(\n",
- " hparams.eval_files,\n",
- " tf_transform_output,\n",
- " is_training=False,\n",
- " batch_size=eval_batch_size)\n",
- "\n",
- " train_spec = tf.estimator.TrainSpec(\n",
- " train_input_fn,\n",
- " max_steps=hparams.train_steps)\n",
- "\n",
- " serving_receiver_fn = lambda: _example_serving_receiver_fn(\n",
- " tf_transform_output, schema)\n",
- "\n",
- " exporter = tf.estimator.FinalExporter('imdb', serving_receiver_fn)\n",
- " eval_spec = tf.estimator.EvalSpec(\n",
- " eval_input_fn,\n",
- " steps=hparams.eval_steps,\n",
- " exporters=[exporter],\n",
- " name='imdb-eval')\n",
- "\n",
- " run_config = tf.estimator.RunConfig(\n",
- " save_checkpoints_steps=999, keep_checkpoint_max=1)\n",
- "\n",
- " run_config = run_config.replace(model_dir=hparams.serving_model_dir)\n",
- "\n",
- " estimator = tf.estimator.Estimator(\n",
- " model_fn=feed_forward_model_fn, config=run_config, params=HPARAMS)\n",
- "\n",
- " # Create a graph regularization config.\n",
- " graph_reg_config = nsl.configs.make_graph_reg_config(\n",
- " max_neighbors=HPARAMS.num_neighbors,\n",
- " multiplier=HPARAMS.graph_regularization_multiplier,\n",
- " distance_type=HPARAMS.distance_type,\n",
- " sum_over_axis=-1)\n",
- "\n",
- " # Invoke the Graph Regularization Estimator wrapper to incorporate\n",
- " # graph-based regularization for training.\n",
- " graph_nsl_estimator = nsl.estimator.add_graph_regularization(\n",
- " estimator,\n",
- " embedding_fn,\n",
- " optimizer_fn=optimizer_fn,\n",
- " graph_reg_config=graph_reg_config)\n",
- "\n",
- " # Create an input receiver for TFMA processing\n",
- " receiver_fn = lambda: _eval_input_receiver_fn(\n",
- " tf_transform_output, schema)\n",
- "\n",
- " return {\n",
- " 'estimator': graph_nsl_estimator,\n",
- " 'train_spec': train_spec,\n",
- " 'eval_spec': eval_spec,\n",
- " 'eval_input_receiver_fn': receiver_fn\n",
- " }"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GnLjStUJIoos"
- },
- "source": [
- "Create and run the `Trainer` component, passing it the file that we created above."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "MWLQI6t0b2pg"
- },
- "outputs": [],
- "source": [
- "# Uses user-provided Python function that implements a model using TensorFlow's\n",
- "# Estimators API.\n",
- "trainer = Trainer(\n",
- " module_file=_trainer_module_file,\n",
- " custom_executor_spec=executor_spec.ExecutorClassSpec(\n",
- " trainer_executor.Executor),\n",
- " transformed_examples=graph_augmentation.outputs['augmented_examples'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " transform_graph=transform.outputs['transform_graph'],\n",
- " train_args=trainer_pb2.TrainArgs(num_steps=10000),\n",
- " eval_args=trainer_pb2.EvalArgs(num_steps=5000))\n",
- "context.run(trainer)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "pDiZvYbFb2ph"
- },
- "source": [
- "Take a peek at the trained model which was exported from `Trainer`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "qDBZG9Oso-BD"
- },
- "outputs": [],
- "source": [
- "train_uri = trainer.outputs['model'].get()[0].uri\n",
- "serving_model_path = os.path.join(train_uri, 'Format-Serving')\n",
- "exported_model = tf.saved_model.load(serving_model_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "KyT3ZVGCZWsj"
- },
- "outputs": [],
- "source": [
- "exported_model.graph.get_operations()[:10] + [\"...\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "zIsspBf5GjKm"
- },
- "source": [
- "Let's visualize the model's metrics using Tensorboard."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "rnKeqLmcGqHH"
- },
- "outputs": [],
- "source": [
- "#docs_infra: no_execute\n",
- "\n",
- "# Get the URI of the output artifact representing the training logs,\n",
- "# which is a directory\n",
- "model_run_dir = trainer.outputs['model_run'].get()[0].uri\n",
- "\n",
- "%load_ext tensorboard\n",
- "%tensorboard --logdir {model_run_dir}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "LgZXZJBsGzHm"
- },
- "source": [
- "## Model Serving\n",
- "\n",
- "Graph regularization only affects the training workflow by adding a regularization term to the loss function. As a result, the model evaluation and serving workflows remain unchanged. It is for the same reason that we've also omitted downstream TFX components that typically come after the *Trainer* component like the *Evaluator*, *Pusher*, etc."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "qOh5FjbWiP-b"
- },
- "source": [
- "## Conclusion\n",
- "\n",
- "We have demonstrated the use of graph regularization using the Neural Structured\n",
- "Learning (NSL) framework in a TFX pipeline even when the input does not contain\n",
- "an explicit graph. We considered the task of sentiment classification of IMDB\n",
- "movie reviews for which we synthesized a similarity graph based on review\n",
- "embeddings. We encourage users to experiment further by using different\n",
- "embeddings for graph construction, varying hyperparameters, changing the amount\n",
- "of supervision, and by defining different model architectures."
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [
- "24gYiJcWNlpA"
- ],
- "name": "Neural_Structured_Learning.ipynb",
- "private_outputs": true,
- "provenance": [],
- "toc_visible": true
- },
- "file_extension": ".py",
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- },
- "mimetype": "text/x-python",
- "name": "python",
- "npconvert_exporter": "python",
- "orig_nbformat": 2,
- "pygments_lexer": "ipython3",
- "version": 3
- },
- "nbformat": 4,
- "nbformat_minor": 0
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-niht8EPmUUl"
+ },
+ "source": [
+ "> Warning: Estimators are not recommended for new code. Estimators run v1.Session
-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our [compatibility guarantees](https://tensorflow.org/guide/versions), but will receive no fixes other than security vulnerabilities. See the [migration guide](https://tensorflow.org/guide/migrate) for details."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z3otbdCMmJiJ"
+ },
+ "source": [
+ "## Overview"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ApxPtg2DiTtd"
+ },
+ "source": [
+ "This notebook classifies movie reviews as *positive* or *negative* using the\n",
+ "text of the review. This is an example of *binary* classification, an important\n",
+ "and widely applicable kind of machine learning problem.\n",
+ "\n",
+ "We will demonstrate the use of graph regularization in this notebook by building\n",
+ "a graph from the given input. The general recipe for building a\n",
+ "graph-regularized model using the Neural Structured Learning (NSL) framework\n",
+ "when the input does not contain an explicit graph is as follows:\n",
+ "\n",
+ "1. Create embeddings for each text sample in the input. This can be done using\n",
+ " pre-trained models such as [word2vec](https://arxiv.org/pdf/1310.4546.pdf),\n",
+ " [Swivel](https://arxiv.org/abs/1602.02215),\n",
+ " [BERT](https://arxiv.org/abs/1810.04805) etc.\n",
+ "2. Build a graph based on these embeddings by using a similarity metric such as\n",
+ " the 'L2' distance, 'cosine' distance, etc. Nodes in the graph correspond to\n",
+ " samples and edges in the graph correspond to similarity between pairs of\n",
+ " samples.\n",
+ "3. Generate training data from the above synthesized graph and sample features.\n",
+ " The resulting training data will contain neighbor features in addition to\n",
+ " the original node features.\n",
+ "4. Create a neural network as a base model using Estimators.\n",
+ "5. Wrap the base model with the `add_graph_regularization` wrapper function,\n",
+ " which is provided by the NSL framework, to create a new graph Estimator\n",
+ " model. This new model will include a graph regularization loss as the\n",
+ " regularization term in its training objective.\n",
+ "6. Train and evaluate the graph Estimator model.\n",
+ "\n",
+ "In this tutorial, we integrate the above workflow in a TFX pipeline using\n",
+ "several custom TFX components as well as a custom graph-regularized trainer\n",
+ "component.\n",
+ "\n",
+ "Below is the schematic for our TFX pipeline. Orange boxes represent\n",
+ "off-the-shelf TFX components and pink boxes represent custom TFX components.\n",
+ "\n",
+ "![TFX Pipeline](images/nsl/nsl-tfx.svg)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EIx0r9-TeVQQ"
+ },
+ "source": [
+ "## Upgrade Pip\n",
+ "\n",
+ "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "-UmVrHUfkUA2"
+ },
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "if 'google.colab' in sys.modules:\n",
+ " !pip install --upgrade pip"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nDOFbB34KY1R"
+ },
+ "source": [
+ "## Install Required Packages"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "yDUe7gk_ztZ-"
+ },
+ "outputs": [],
+ "source": [
+ "# TFX has a constraint of 1.16 due to the removal of tf.estimator support.\n",
+ "!pip install -q \\\n",
+ " \"tfx<1.16\" \\\n",
+ " neural-structured-learning \\\n",
+ " tensorflow-hub \\\n",
+ " tensorflow-datasets"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1CeGS8G_eueJ"
+ },
+ "source": [
+ "## Did you restart the runtime?\n",
+ "\n",
+ "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime > Restart runtime ...). This is because of the way that Colab loads packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x6FJ64qMNLez"
+ },
+ "source": [
+ "## Dependencies and imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2ew7HTbPpCJH"
+ },
+ "outputs": [],
+ "source": [
+ "import apache_beam as beam\n",
+ "import gzip as gzip_lib\n",
+ "import numpy as np\n",
+ "import os\n",
+ "import pprint\n",
+ "import shutil\n",
+ "import tempfile\n",
+ "import urllib\n",
+ "import uuid\n",
+ "pp = pprint.PrettyPrinter()\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "import neural_structured_learning as nsl\n",
+ "\n",
+ "import tfx\n",
+ "from tfx.components.evaluator.component import Evaluator\n",
+ "from tfx.components.example_gen.import_example_gen.component import ImportExampleGen\n",
+ "from tfx.components.example_validator.component import ExampleValidator\n",
+ "from tfx.components.model_validator.component import ModelValidator\n",
+ "from tfx.components.pusher.component import Pusher\n",
+ "from tfx.components.schema_gen.component import SchemaGen\n",
+ "from tfx.components.statistics_gen.component import StatisticsGen\n",
+ "from tfx.components.trainer import executor as trainer_executor\n",
+ "from tfx.components.trainer.component import Trainer\n",
+ "from tfx.components.transform.component import Transform\n",
+ "from tfx.dsl.components.base import executor_spec\n",
+ "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
+ "from tfx.proto import evaluator_pb2\n",
+ "from tfx.proto import example_gen_pb2\n",
+ "from tfx.proto import pusher_pb2\n",
+ "from tfx.proto import trainer_pb2\n",
+ "\n",
+ "from tfx.types import artifact\n",
+ "from tfx.types import artifact_utils\n",
+ "from tfx.types import channel\n",
+ "from tfx.types import standard_artifacts\n",
+ "from tfx.types.standard_artifacts import Examples\n",
+ "\n",
+ "from tfx.dsl.component.experimental.annotations import InputArtifact\n",
+ "from tfx.dsl.component.experimental.annotations import OutputArtifact\n",
+ "from tfx.dsl.component.experimental.annotations import Parameter\n",
+ "from tfx.dsl.component.experimental.decorators import component\n",
+ "\n",
+ "from tensorflow_metadata.proto.v0 import anomalies_pb2\n",
+ "from tensorflow_metadata.proto.v0 import schema_pb2\n",
+ "from tensorflow_metadata.proto.v0 import statistics_pb2\n",
+ "\n",
+ "import tensorflow_data_validation as tfdv\n",
+ "import tensorflow_transform as tft\n",
+ "import tensorflow_model_analysis as tfma\n",
+ "import tensorflow_hub as hub\n",
+ "import tensorflow_datasets as tfds\n",
+ "\n",
+ "print(\"TF Version: \", tf.__version__)\n",
+ "print(\"Eager mode: \", tf.executing_eagerly())\n",
+ "print(\n",
+ " \"GPU is\",\n",
+ " \"available\" if tf.config.list_physical_devices(\"GPU\") else \"NOT AVAILABLE\")\n",
+ "print(\"NSL Version: \", nsl.__version__)\n",
+ "print(\"TFX Version: \", tfx.__version__)\n",
+ "print(\"TFDV version: \", tfdv.__version__)\n",
+ "print(\"TFT version: \", tft.__version__)\n",
+ "print(\"TFMA version: \", tfma.__version__)\n",
+ "print(\"Hub version: \", hub.__version__)\n",
+ "print(\"Beam version: \", beam.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nGwwFd99n42P"
+ },
+ "source": [
+ "## IMDB dataset\n",
+ "\n",
+ "The\n",
+ "[IMDB dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews)\n",
+ "contains the text of 50,000 movie reviews from the\n",
+ "[Internet Movie Database](https://www.imdb.com/). These are split into 25,000\n",
+ "reviews for training and 25,000 reviews for testing. The training and testing\n",
+ "sets are *balanced*, meaning they contain an equal number of positive and\n",
+ "negative reviews.\n",
+ "Moreover, there are 50,000 additional unlabeled movie reviews."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iAsKG535pHep"
+ },
+ "source": [
+ "### Download preprocessed IMDB dataset\n",
+ "\n",
+ "The following code downloads the IMDB dataset (or uses a cached copy if it has already been downloaded) using TFDS. To speed up this notebook we will use only 10,000 labeled reviews and 10,000 unlabeled reviews for training, and 10,000 test reviews for evaluation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "__cZi2Ic48KL"
+ },
+ "outputs": [],
+ "source": [
+ "train_set, eval_set = tfds.load(\n",
+ " \"imdb_reviews:1.0.0\",\n",
+ " split=[\"train[:10000]+unsupervised[:10000]\", \"test[:10000]\"],\n",
+ " shuffle_files=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nE9tNh-67Y3W"
+ },
+ "source": [
+ "Let's look at a few reviews from the training set:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "LsnHde8T67Jz"
+ },
+ "outputs": [],
+ "source": [
+ "for tfrecord in train_set.take(4):\n",
+ " print(\"Review: {}\".format(tfrecord[\"text\"].numpy().decode(\"utf-8\")[:300]))\n",
+ " print(\"Label: {}\\n\".format(tfrecord[\"label\"].numpy()))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "0wG7v3rk-Cwo"
+ },
+ "outputs": [],
+ "source": [
+ "def _dict_to_example(instance):\n",
+ " \"\"\"Decoded CSV to tf example.\"\"\"\n",
+ " feature = {}\n",
+ " for key, value in instance.items():\n",
+ " if value is None:\n",
+ " feature[key] = tf.train.Feature()\n",
+ " elif value.dtype == np.integer:\n",
+ " feature[key] = tf.train.Feature(\n",
+ " int64_list=tf.train.Int64List(value=value.tolist()))\n",
+ " elif value.dtype == np.float32:\n",
+ " feature[key] = tf.train.Feature(\n",
+ " float_list=tf.train.FloatList(value=value.tolist()))\n",
+ " else:\n",
+ " feature[key] = tf.train.Feature(\n",
+ " bytes_list=tf.train.BytesList(value=value.tolist()))\n",
+ " return tf.train.Example(features=tf.train.Features(feature=feature))\n",
+ "\n",
+ "\n",
+ "examples_path = tempfile.mkdtemp(prefix=\"tfx-data\")\n",
+ "train_path = os.path.join(examples_path, \"train.tfrecord\")\n",
+ "eval_path = os.path.join(examples_path, \"eval.tfrecord\")\n",
+ "\n",
+ "for path, dataset in [(train_path, train_set), (eval_path, eval_set)]:\n",
+ " with tf.io.TFRecordWriter(path) as writer:\n",
+ " for example in dataset:\n",
+ " writer.write(\n",
+ " _dict_to_example({\n",
+ " \"label\": np.array([example[\"label\"].numpy()]),\n",
+ " \"text\": np.array([example[\"text\"].numpy()]),\n",
+ " }).SerializeToString())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HdQWxfsVkzdJ"
+ },
+ "source": [
+ "## Run TFX Components Interactively\n",
+ "\n",
+ "In the cells that follow you will construct TFX components and run each one interactively within the InteractiveContext to obtain `ExecutionResult` objects. This mirrors the process of an orchestrator running components in a TFX DAG based on when the dependencies for each component are met."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4aVuXUil7hil"
+ },
+ "outputs": [],
+ "source": [
+ "context = InteractiveContext()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L9fwt9gQk3BR"
+ },
+ "source": [
+ "### The ExampleGen Component\n",
+ "In any ML development process the first step when starting code development is to ingest the training and test datasets. The `ExampleGen` component brings data into the TFX pipeline.\n",
+ "\n",
+ "Create an ExampleGen component and run it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "WdH4ql3Y7pT4"
+ },
+ "outputs": [],
+ "source": [
+ "input_config = example_gen_pb2.Input(splits=[\n",
+ " example_gen_pb2.Input.Split(name='train', pattern='train.tfrecord'),\n",
+ " example_gen_pb2.Input.Split(name='eval', pattern='eval.tfrecord')\n",
+ "])\n",
+ "\n",
+ "example_gen = ImportExampleGen(input_base=examples_path, input_config=input_config)\n",
+ "\n",
+ "context.run(example_gen, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "IeUp6xCCrxsS"
+ },
+ "outputs": [],
+ "source": [
+ "for artifact in example_gen.outputs['examples'].get():\n",
+ " print(artifact)\n",
+ "\n",
+ "print('\\nexample_gen.outputs is a {}'.format(type(example_gen.outputs)))\n",
+ "print(example_gen.outputs)\n",
+ "\n",
+ "print(example_gen.outputs['examples'].get()[0].split_names)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0SXc2OGnDWz5"
+ },
+ "source": [
+ "The component's outputs include 2 artifacts:\n",
+ "* the training examples (10,000 labeled reviews + 10,000 unlabeled reviews)\n",
+ "* the eval examples (10,000 labeled reviews)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pcPppPASQzFa"
+ },
+ "source": [
+ "### The IdentifyExamples Custom Component\n",
+ "To use NSL, we will need each instance to have a unique ID. We create a custom\n",
+ "component that adds such a unique ID to all instances across all splits. We\n",
+ "leverage [Apache Beam](https://beam.apache.org) to be able to easily scale to\n",
+ "large datasets if needed."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XHCUzXA5qeWe"
+ },
+ "outputs": [],
+ "source": [
+ "def make_example_with_unique_id(example, id_feature_name):\n",
+ " \"\"\"Adds a unique ID to the given `tf.train.Example` proto.\n",
+ "\n",
+ " This function uses Python's 'uuid' module to generate a universally unique\n",
+ " identifier for each example.\n",
+ "\n",
+ " Args:\n",
+ " example: An instance of a `tf.train.Example` proto.\n",
+ " id_feature_name: The name of the feature in the resulting `tf.train.Example`\n",
+ " that will contain the unique identifier.\n",
+ "\n",
+ " Returns:\n",
+ " A new `tf.train.Example` proto that includes a unique identifier as an\n",
+ " additional feature.\n",
+ " \"\"\"\n",
+ " result = tf.train.Example()\n",
+ " result.CopyFrom(example)\n",
+ " unique_id = uuid.uuid4()\n",
+ " result.features.feature.get_or_create(\n",
+ " id_feature_name).bytes_list.MergeFrom(\n",
+ " tf.train.BytesList(value=[str(unique_id).encode('utf-8')]))\n",
+ " return result\n",
+ "\n",
+ "\n",
+ "@component\n",
+ "def IdentifyExamples(orig_examples: InputArtifact[Examples],\n",
+ " identified_examples: OutputArtifact[Examples],\n",
+ " id_feature_name: Parameter[str],\n",
+ " component_name: Parameter[str]) -> None:\n",
+ "\n",
+ " # Get a list of the splits in input_data\n",
+ " splits_list = artifact_utils.decode_split_names(\n",
+ " split_names=orig_examples.split_names)\n",
+ " # For completeness, encode the splits names and payload_format.\n",
+ " # We could also just use input_data.split_names.\n",
+ " identified_examples.split_names = artifact_utils.encode_split_names(\n",
+ " splits=splits_list)\n",
+ " # TODO(b/168616829): Remove populating payload_format after tfx 0.25.0.\n",
+ " identified_examples.set_string_custom_property(\n",
+ " \"payload_format\",\n",
+ " orig_examples.get_string_custom_property(\"payload_format\"))\n",
+ "\n",
+ "\n",
+ " for split in splits_list:\n",
+ " input_dir = artifact_utils.get_split_uri([orig_examples], split)\n",
+ " output_dir = artifact_utils.get_split_uri([identified_examples], split)\n",
+ " os.mkdir(output_dir)\n",
+ " with beam.Pipeline() as pipeline:\n",
+ " (pipeline\n",
+ " | 'ReadExamples' >> beam.io.ReadFromTFRecord(\n",
+ " os.path.join(input_dir, '*'),\n",
+ " coder=beam.coders.coders.ProtoCoder(tf.train.Example))\n",
+ " | 'AddUniqueId' >> beam.Map(make_example_with_unique_id, id_feature_name)\n",
+ " | 'WriteIdentifiedExamples' >> beam.io.WriteToTFRecord(\n",
+ " file_path_prefix=os.path.join(output_dir, 'data_tfrecord'),\n",
+ " coder=beam.coders.coders.ProtoCoder(tf.train.Example),\n",
+ " file_name_suffix='.gz'))\n",
+ "\n",
+ " return"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ZtLxNWHPO0je"
+ },
+ "outputs": [],
+ "source": [
+ "identify_examples = IdentifyExamples(\n",
+ " orig_examples=example_gen.outputs['examples'],\n",
+ " component_name=u'IdentifyExamples',\n",
+ " id_feature_name=u'id')\n",
+ "context.run(identify_examples, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "csM6BFhtk5Aa"
+ },
+ "source": [
+ "### The StatisticsGen Component\n",
+ "\n",
+ "The `StatisticsGen` component computes descriptive statistics for your dataset. The statistics that it generates can be visualized for review, and are used for example validation and to infer a schema.\n",
+ "\n",
+ "Create a StatisticsGen component and run it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MAscCCYWgA-9"
+ },
+ "outputs": [],
+ "source": [
+ "# Computes statistics over data for visualization and example validation.\n",
+ "statistics_gen = StatisticsGen(\n",
+ " examples=identify_examples.outputs[\"identified_examples\"])\n",
+ "context.run(statistics_gen, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HLKLTO9Nk60p"
+ },
+ "source": [
+ "### The SchemaGen Component\n",
+ "\n",
+ "The `SchemaGen` component generates a schema for your data based on the statistics from StatisticsGen. It tries to infer the data types of each of your features, and the ranges of legal values for categorical features.\n",
+ "\n",
+ "Create a SchemaGen component and run it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ygQvZ6hsiQ_J"
+ },
+ "outputs": [],
+ "source": [
+ "# Generates schema based on statistics files.\n",
+ "schema_gen = SchemaGen(\n",
+ " statistics=statistics_gen.outputs['statistics'], infer_feature_shape=False)\n",
+ "context.run(schema_gen, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kdtU3u01FR-2"
+ },
+ "source": [
+ "The generated artifact is just a `schema.pbtxt` containing a text representation of a `schema_pb2.Schema` protobuf:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "L6-tgKi6A_gK"
+ },
+ "outputs": [],
+ "source": [
+ "train_uri = schema_gen.outputs['schema'].get()[0].uri\n",
+ "schema_filename = os.path.join(train_uri, 'schema.pbtxt')\n",
+ "schema = tfx.utils.io_utils.parse_pbtxt_file(\n",
+ " file_name=schema_filename, message=schema_pb2.Schema())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FaSgx5qIFelw"
+ },
+ "source": [
+ "It can be visualized using `tfdv.display_schema()` (we will look at this in more detail in a subsequent lab):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "gycOsJIQFhi3"
+ },
+ "outputs": [],
+ "source": [
+ "tfdv.display_schema(schema)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "V1qcUuO9k9f8"
+ },
+ "source": [
+ "### The ExampleValidator Component\n",
+ "\n",
+ "The `ExampleValidator` performs anomaly detection, based on the statistics from StatisticsGen and the schema from SchemaGen. It looks for problems such as missing values, values of the wrong type, or categorical values outside of the domain of acceptable values.\n",
+ "\n",
+ "Create an ExampleValidator component and run it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XRlRUuGgiXks"
+ },
+ "outputs": [],
+ "source": [
+ "# Performs anomaly detection based on statistics and data schema.\n",
+ "validate_stats = ExampleValidator(\n",
+ " statistics=statistics_gen.outputs['statistics'],\n",
+ " schema=schema_gen.outputs['schema'])\n",
+ "context.run(validate_stats, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "g3f2vmrF_e9b"
+ },
+ "source": [
+ "### The SynthesizeGraph Component"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3oCuXo4BPfGr"
+ },
+ "source": [
+ "Graph construction involves creating embeddings for text samples and then using\n",
+ "a similarity function to compare the embeddings."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Gf8B3KxcinZ0"
+ },
+ "source": [
+ "We will use pretrained Swivel embeddings to create embeddings in the\n",
+ "`tf.train.Example` format for each sample in the input. We will store the\n",
+ "resulting embeddings in the `TFRecord` format along with the sample's ID.\n",
+ "This is important and will allow us match sample embeddings with corresponding\n",
+ "nodes in the graph later."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_hSzZNdbPa4X"
+ },
+ "source": [
+ "Once we have the sample embeddings, we will use them to build a similarity\n",
+ "graph, i.e, nodes in this graph will correspond to samples and edges in this\n",
+ "graph will correspond to similarity between pairs of nodes.\n",
+ "\n",
+ "Neural Structured Learning provides a graph building library to build a graph\n",
+ "based on sample embeddings. It uses **cosine similarity** as the similarity\n",
+ "measure to compare embeddings and build edges between them. It also allows us to specify a similarity threshold, which can be used to discard dissimilar edges from the final graph. In the following example, using 0.99 as the similarity threshold, we end up with a graph that has 111,066 bi-directional edges."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nERXNfSWPa4Z"
+ },
+ "source": [
+ "**Note:** Graph quality and by extension, embedding quality, are very important\n",
+ "for graph regularization. While we use Swivel embeddings in this notebook, using BERT embeddings for instance, will likely capture review semantics more\n",
+ "accurately. We encourage users to use embeddings of their choice and as appropriate to their needs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2bAttbhgPa4V"
+ },
+ "outputs": [],
+ "source": [
+ "swivel_url = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'\n",
+ "hub_layer = hub.KerasLayer(swivel_url, input_shape=[], dtype=tf.string)\n",
+ "\n",
+ "\n",
+ "def _bytes_feature(value):\n",
+ " \"\"\"Returns a bytes_list from a string / byte.\"\"\"\n",
+ " return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))\n",
+ "\n",
+ "\n",
+ "def _float_feature(value):\n",
+ " \"\"\"Returns a float_list from a float / double.\"\"\"\n",
+ " return tf.train.Feature(float_list=tf.train.FloatList(value=value))\n",
+ "\n",
+ "\n",
+ "def create_embedding_example(example):\n",
+ " \"\"\"Create tf.Example containing the sample's embedding and its ID.\"\"\"\n",
+ " sentence_embedding = hub_layer(tf.sparse.to_dense(example['text']))\n",
+ "\n",
+ " # Flatten the sentence embedding back to 1-D.\n",
+ " sentence_embedding = tf.reshape(sentence_embedding, shape=[-1])\n",
+ "\n",
+ " feature_dict = {\n",
+ " 'id': _bytes_feature(tf.sparse.to_dense(example['id']).numpy()),\n",
+ " 'embedding': _float_feature(sentence_embedding.numpy().tolist())\n",
+ " }\n",
+ "\n",
+ " return tf.train.Example(features=tf.train.Features(feature=feature_dict))\n",
+ "\n",
+ "\n",
+ "def create_dataset(uri):\n",
+ " tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
+ " return tf.data.TFRecordDataset(tfrecord_filenames, compression_type='GZIP')\n",
+ "\n",
+ "\n",
+ "def create_embeddings(train_path, output_path):\n",
+ " dataset = create_dataset(train_path)\n",
+ " embeddings_path = os.path.join(output_path, 'embeddings.tfr')\n",
+ "\n",
+ " feature_map = {\n",
+ " 'label': tf.io.FixedLenFeature([], tf.int64),\n",
+ " 'id': tf.io.VarLenFeature(tf.string),\n",
+ " 'text': tf.io.VarLenFeature(tf.string)\n",
+ " }\n",
+ "\n",
+ " with tf.io.TFRecordWriter(embeddings_path) as writer:\n",
+ " for tfrecord in dataset:\n",
+ " tensor_dict = tf.io.parse_single_example(tfrecord, feature_map)\n",
+ " embedding_example = create_embedding_example(tensor_dict)\n",
+ " writer.write(embedding_example.SerializeToString())\n",
+ "\n",
+ "\n",
+ "def build_graph(output_path, similarity_threshold):\n",
+ " embeddings_path = os.path.join(output_path, 'embeddings.tfr')\n",
+ " graph_path = os.path.join(output_path, 'graph.tsv')\n",
+ " graph_builder_config = nsl.configs.GraphBuilderConfig(\n",
+ " similarity_threshold=similarity_threshold,\n",
+ " lsh_splits=32,\n",
+ " lsh_rounds=15,\n",
+ " random_seed=12345)\n",
+ " nsl.tools.build_graph_from_config([embeddings_path], graph_path,\n",
+ " graph_builder_config)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ITkf2SLg1TG7"
+ },
+ "outputs": [],
+ "source": [
+ "\"\"\"Custom Artifact type\"\"\"\n",
+ "\n",
+ "\n",
+ "class SynthesizedGraph(tfx.types.artifact.Artifact):\n",
+ " \"\"\"Output artifact of the SynthesizeGraph component\"\"\"\n",
+ " TYPE_NAME = 'SynthesizedGraphPath'\n",
+ " PROPERTIES = {\n",
+ " 'span': standard_artifacts.SPAN_PROPERTY,\n",
+ " 'split_names': standard_artifacts.SPLIT_NAMES_PROPERTY,\n",
+ " }\n",
+ "\n",
+ "\n",
+ "@component\n",
+ "def SynthesizeGraph(identified_examples: InputArtifact[Examples],\n",
+ " synthesized_graph: OutputArtifact[SynthesizedGraph],\n",
+ " similarity_threshold: Parameter[float],\n",
+ " component_name: Parameter[str]) -> None:\n",
+ "\n",
+ " # Get a list of the splits in input_data\n",
+ " splits_list = artifact_utils.decode_split_names(\n",
+ " split_names=identified_examples.split_names)\n",
+ "\n",
+ " # We build a graph only based on the 'Split-train' split which includes both\n",
+ " # labeled and unlabeled examples.\n",
+ " train_input_examples_uri = os.path.join(identified_examples.uri,\n",
+ " 'Split-train')\n",
+ " output_graph_uri = os.path.join(synthesized_graph.uri, 'Split-train')\n",
+ " os.mkdir(output_graph_uri)\n",
+ "\n",
+ " print('Creating embeddings...')\n",
+ " create_embeddings(train_input_examples_uri, output_graph_uri)\n",
+ "\n",
+ " print('Synthesizing graph...')\n",
+ " build_graph(output_graph_uri, similarity_threshold)\n",
+ "\n",
+ " synthesized_graph.split_names = artifact_utils.encode_split_names(\n",
+ " splits=['Split-train'])\n",
+ "\n",
+ " return"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "H0ZkHvJMA-0G"
+ },
+ "outputs": [],
+ "source": [
+ "synthesize_graph = SynthesizeGraph(\n",
+ " identified_examples=identify_examples.outputs['identified_examples'],\n",
+ " component_name=u'SynthesizeGraph',\n",
+ " similarity_threshold=0.99)\n",
+ "context.run(synthesize_graph, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "o54M-0Q11FcS"
+ },
+ "outputs": [],
+ "source": [
+ "train_uri = synthesize_graph.outputs[\"synthesized_graph\"].get()[0].uri\n",
+ "os.listdir(train_uri)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "IRK_rS_q1UcZ"
+ },
+ "outputs": [],
+ "source": [
+ "graph_path = os.path.join(train_uri, \"Split-train\", \"graph.tsv\")\n",
+ "print(\"node 1\\t\\t\\t\\t\\tnode 2\\t\\t\\t\\t\\tsimilarity\")\n",
+ "!head {graph_path}\n",
+ "print(\"...\")\n",
+ "!tail {graph_path}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "uybqyWztvCGm"
+ },
+ "outputs": [],
+ "source": [
+ "!wc -l {graph_path}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JPViEz5RlA36"
+ },
+ "source": [
+ "### The Transform Component\n",
+ "\n",
+ "The `Transform` component performs data transformations and feature engineering. The results include an input TensorFlow graph which is used during both training and serving to preprocess the data before training or inference. This graph becomes part of the SavedModel that is the result of model training. Since the same input graph is used for both training and serving, the preprocessing will always be the same, and only needs to be written once.\n",
+ "\n",
+ "The Transform component requires more code than many other components because of the arbitrary complexity of the feature engineering that you may need for the data and/or model that you're working with. It requires code files to be available which define the processing needed."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_USkfut69gNW"
+ },
+ "source": [
+ "Each sample will include the following three features:\n",
+ "\n",
+ "1. **id**: The node ID of the sample.\n",
+ "2. **text_xf**: An int64 list containing word IDs.\n",
+ "3. **label_xf**: A singleton int64 identifying the target class of the review: 0=negative, 1=positive."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XUYeCayFG7kH"
+ },
+ "source": [
+ "Let's define a module containing the `preprocessing_fn()` function that we will pass to the `Transform` component:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "7uuWiQbOG9ki"
+ },
+ "outputs": [],
+ "source": [
+ "_transform_module_file = 'imdb_transform.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "v3EIuVQnBfH7"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_transform_module_file}\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "\n",
+ "import tensorflow_transform as tft\n",
+ "\n",
+ "SEQUENCE_LENGTH = 100\n",
+ "VOCAB_SIZE = 10000\n",
+ "OOV_SIZE = 100\n",
+ "\n",
+ "def tokenize_reviews(reviews, sequence_length=SEQUENCE_LENGTH):\n",
+ " reviews = tf.strings.lower(reviews)\n",
+ " reviews = tf.strings.regex_replace(reviews, r\" '| '|^'|'$\", \" \")\n",
+ " reviews = tf.strings.regex_replace(reviews, \"[^a-z' ]\", \" \")\n",
+ " tokens = tf.strings.split(reviews)[:, :sequence_length]\n",
+ " start_tokens = tf.fill([tf.shape(reviews)[0], 1], \"\")\n",
+ " end_tokens = tf.fill([tf.shape(reviews)[0], 1], \"\")\n",
+ " tokens = tf.concat([start_tokens, tokens, end_tokens], axis=1)\n",
+ " tokens = tokens[:, :sequence_length]\n",
+ " tokens = tokens.to_tensor(default_value=\"\")\n",
+ " pad = sequence_length - tf.shape(tokens)[1]\n",
+ " tokens = tf.pad(tokens, [[0, 0], [0, pad]], constant_values=\"\")\n",
+ " return tf.reshape(tokens, [-1, sequence_length])\n",
+ "\n",
+ "def preprocessing_fn(inputs):\n",
+ " \"\"\"tf.transform's callback function for preprocessing inputs.\n",
+ "\n",
+ " Args:\n",
+ " inputs: map from feature keys to raw not-yet-transformed features.\n",
+ "\n",
+ " Returns:\n",
+ " Map from string feature key to transformed feature operations.\n",
+ " \"\"\"\n",
+ " outputs = {}\n",
+ " outputs[\"id\"] = inputs[\"id\"]\n",
+ " tokens = tokenize_reviews(_fill_in_missing(inputs[\"text\"], ''))\n",
+ " outputs[\"text_xf\"] = tft.compute_and_apply_vocabulary(\n",
+ " tokens,\n",
+ " top_k=VOCAB_SIZE,\n",
+ " num_oov_buckets=OOV_SIZE)\n",
+ " outputs[\"label_xf\"] = _fill_in_missing(inputs[\"label\"], -1)\n",
+ " return outputs\n",
+ "\n",
+ "def _fill_in_missing(x, default_value):\n",
+ " \"\"\"Replace missing values in a SparseTensor.\n",
+ "\n",
+ " Fills in missing values of `x` with the default_value.\n",
+ "\n",
+ " Args:\n",
+ " x: A `SparseTensor` of rank 2. Its dense shape should have size at most 1\n",
+ " in the second dimension.\n",
+ " default_value: the value with which to replace the missing values.\n",
+ "\n",
+ " Returns:\n",
+ " A rank 1 tensor where missing values of `x` have been filled in.\n",
+ " \"\"\"\n",
+ " if not isinstance(x, tf.sparse.SparseTensor):\n",
+ " return x\n",
+ " return tf.squeeze(\n",
+ " tf.sparse.to_dense(\n",
+ " tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
+ " default_value),\n",
+ " axis=1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eeMVMafpHHX1"
+ },
+ "source": [
+ "Create and run the `Transform` component, referring to the files that were created above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "jHfhth_GiZI9"
+ },
+ "outputs": [],
+ "source": [
+ "# Performs transformations and feature engineering in training and serving.\n",
+ "transform = Transform(\n",
+ " examples=identify_examples.outputs['identified_examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " module_file=_transform_module_file)\n",
+ "context.run(transform, enable_cache=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_jbZO1ykHOeG"
+ },
+ "source": [
+ "The `Transform` component has 2 types of outputs:\n",
+ "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
+ "* `transformed_examples` represents the preprocessed training and evaluation data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "j4UjersvAC7p"
+ },
+ "outputs": [],
+ "source": [
+ "transform.outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wRFMlRcdHlQy"
+ },
+ "source": [
+ "Take a peek at the `transform_graph` artifact: it points to a directory containing 3 subdirectories:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "E4I-cqfQQvaW"
+ },
+ "outputs": [],
+ "source": [
+ "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
+ "os.listdir(train_uri)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9374B4RpHzor"
+ },
+ "source": [
+ "The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data. The `transformed_metadata` subdirectory contains the schema of the preprocessed data.\n",
+ "\n",
+ "Take a look at some of the transformed examples and check that they are indeed processed as intended."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "-QPONyzDTswf"
+ },
+ "outputs": [],
+ "source": [
+ "def pprint_examples(artifact, n_examples=3):\n",
+ " print(\"artifact:\", artifact)\n",
+ " uri = os.path.join(artifact.uri, \"Split-train\")\n",
+ " print(\"uri:\", uri)\n",
+ " tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
+ " print(\"tfrecord_filenames:\", tfrecord_filenames)\n",
+ " dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
+ " for tfrecord in dataset.take(n_examples):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example.FromString(serialized_example)\n",
+ " pp.pprint(example)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2zIepQhSQoPa"
+ },
+ "outputs": [],
+ "source": [
+ "pprint_examples(transform.outputs['transformed_examples'].get()[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vpGvPKielIvI"
+ },
+ "source": [
+ "### The GraphAugmentation Component\n",
+ "\n",
+ "Since we have the sample features and the synthesized graph, we can generate the\n",
+ "augmented training data for Neural Structured Learning. The NSL framework\n",
+ "provides a library to combine the graph and the sample features to produce\n",
+ "the final training data for graph regularization. The resulting training data\n",
+ "will include original sample features as well as features of their corresponding\n",
+ "neighbors.\n",
+ "\n",
+ "In this tutorial, we consider undirected edges and use a maximum of 3 neighbors\n",
+ "per sample to augment training data with graph neighbors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "gI6P_-AXGm04"
+ },
+ "outputs": [],
+ "source": [
+ "def split_train_and_unsup(input_uri):\n",
+ " 'Separate the labeled and unlabeled instances.'\n",
+ "\n",
+ " tmp_dir = tempfile.mkdtemp(prefix='tfx-data')\n",
+ " tfrecord_filenames = [\n",
+ " os.path.join(input_uri, filename) for filename in os.listdir(input_uri)\n",
+ " ]\n",
+ " train_path = os.path.join(tmp_dir, 'train.tfrecord')\n",
+ " unsup_path = os.path.join(tmp_dir, 'unsup.tfrecord')\n",
+ " with tf.io.TFRecordWriter(train_path) as train_writer, \\\n",
+ " tf.io.TFRecordWriter(unsup_path) as unsup_writer:\n",
+ " for tfrecord in tf.data.TFRecordDataset(\n",
+ " tfrecord_filenames, compression_type='GZIP'):\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(tfrecord.numpy())\n",
+ " if ('label_xf' not in example.features.feature or\n",
+ " example.features.feature['label_xf'].int64_list.value[0] == -1):\n",
+ " writer = unsup_writer\n",
+ " else:\n",
+ " writer = train_writer\n",
+ " writer.write(tfrecord.numpy())\n",
+ " return train_path, unsup_path\n",
+ "\n",
+ "\n",
+ "def gzip(filepath):\n",
+ " with open(filepath, 'rb') as f_in:\n",
+ " with gzip_lib.open(filepath + '.gz', 'wb') as f_out:\n",
+ " shutil.copyfileobj(f_in, f_out)\n",
+ " os.remove(filepath)\n",
+ "\n",
+ "\n",
+ "def copy_tfrecords(input_uri, output_uri):\n",
+ " for filename in os.listdir(input_uri):\n",
+ " input_filename = os.path.join(input_uri, filename)\n",
+ " output_filename = os.path.join(output_uri, filename)\n",
+ " shutil.copyfile(input_filename, output_filename)\n",
+ "\n",
+ "\n",
+ "@component\n",
+ "def GraphAugmentation(identified_examples: InputArtifact[Examples],\n",
+ " synthesized_graph: InputArtifact[SynthesizedGraph],\n",
+ " augmented_examples: OutputArtifact[Examples],\n",
+ " num_neighbors: Parameter[int],\n",
+ " component_name: Parameter[str]) -> None:\n",
+ "\n",
+ " # Get a list of the splits in input_data\n",
+ " splits_list = artifact_utils.decode_split_names(\n",
+ " split_names=identified_examples.split_names)\n",
+ "\n",
+ " train_input_uri = os.path.join(identified_examples.uri, 'Split-train')\n",
+ " eval_input_uri = os.path.join(identified_examples.uri, 'Split-eval')\n",
+ " train_graph_uri = os.path.join(synthesized_graph.uri, 'Split-train')\n",
+ " train_output_uri = os.path.join(augmented_examples.uri, 'Split-train')\n",
+ " eval_output_uri = os.path.join(augmented_examples.uri, 'Split-eval')\n",
+ "\n",
+ " os.mkdir(train_output_uri)\n",
+ " os.mkdir(eval_output_uri)\n",
+ "\n",
+ " # Separate the labeled and unlabeled examples from the 'Split-train' split.\n",
+ " train_path, unsup_path = split_train_and_unsup(train_input_uri)\n",
+ "\n",
+ " output_path = os.path.join(train_output_uri, 'nsl_train_data.tfr')\n",
+ " pack_nbrs_args = dict(\n",
+ " labeled_examples_path=train_path,\n",
+ " unlabeled_examples_path=unsup_path,\n",
+ " graph_path=os.path.join(train_graph_uri, 'graph.tsv'),\n",
+ " output_training_data_path=output_path,\n",
+ " add_undirected_edges=True,\n",
+ " max_nbrs=num_neighbors)\n",
+ " print('nsl.tools.pack_nbrs arguments:', pack_nbrs_args)\n",
+ " nsl.tools.pack_nbrs(**pack_nbrs_args)\n",
+ "\n",
+ " # Downstream components expect gzip'ed TFRecords.\n",
+ " gzip(output_path)\n",
+ "\n",
+ " # The test examples are left untouched and are simply copied over.\n",
+ " copy_tfrecords(eval_input_uri, eval_output_uri)\n",
+ "\n",
+ " augmented_examples.split_names = identified_examples.split_names\n",
+ "\n",
+ " return"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "r9MIEVDiOANe"
+ },
+ "outputs": [],
+ "source": [
+ "# Augments training data with graph neighbors.\n",
+ "graph_augmentation = GraphAugmentation(\n",
+ " identified_examples=transform.outputs['transformed_examples'],\n",
+ " synthesized_graph=synthesize_graph.outputs['synthesized_graph'],\n",
+ " component_name=u'GraphAugmentation',\n",
+ " num_neighbors=3)\n",
+ "context.run(graph_augmentation, enable_cache=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "gpSLs3Hx8viI"
+ },
+ "outputs": [],
+ "source": [
+ "pprint_examples(graph_augmentation.outputs['augmented_examples'].get()[0], 6)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OBJFtnl6lCg9"
+ },
+ "source": [
+ "### The Trainer Component\n",
+ "\n",
+ "The `Trainer` component trains models using TensorFlow.\n",
+ "\n",
+ "Create a Python module containing a `trainer_fn` function, which must return an estimator. If you prefer creating a Keras model, you can do so and then convert it to an estimator using `keras.model_to_estimator()`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "5ajvClE6b2pd"
+ },
+ "outputs": [],
+ "source": [
+ "# Setup paths.\n",
+ "_trainer_module_file = 'imdb_trainer.py'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_dh6AejVk2Oq"
+ },
+ "outputs": [],
+ "source": [
+ "%%writefile {_trainer_module_file}\n",
+ "\n",
+ "import neural_structured_learning as nsl\n",
+ "\n",
+ "import tensorflow as tf\n",
+ "\n",
+ "import tensorflow_model_analysis as tfma\n",
+ "import tensorflow_transform as tft\n",
+ "from tensorflow_transform.tf_metadata import schema_utils\n",
+ "\n",
+ "\n",
+ "NBR_FEATURE_PREFIX = 'NL_nbr_'\n",
+ "NBR_WEIGHT_SUFFIX = '_weight'\n",
+ "LABEL_KEY = 'label'\n",
+ "ID_FEATURE_KEY = 'id'\n",
+ "\n",
+ "def _transformed_name(key):\n",
+ " return key + '_xf'\n",
+ "\n",
+ "\n",
+ "def _transformed_names(keys):\n",
+ " return [_transformed_name(key) for key in keys]\n",
+ "\n",
+ "\n",
+ "# Hyperparameters:\n",
+ "#\n",
+ "# We will use an instance of `HParams` to include various hyperparameters and\n",
+ "# constants used for training and evaluation. We briefly describe each of them\n",
+ "# below:\n",
+ "#\n",
+ "# - max_seq_length: This is the maximum number of words considered from each\n",
+ "# movie review in this example.\n",
+ "# - vocab_size: This is the size of the vocabulary considered for this\n",
+ "# example.\n",
+ "# - oov_size: This is the out-of-vocabulary size considered for this example.\n",
+ "# - distance_type: This is the distance metric used to regularize the sample\n",
+ "# with its neighbors.\n",
+ "# - graph_regularization_multiplier: This controls the relative weight of the\n",
+ "# graph regularization term in the overall\n",
+ "# loss function.\n",
+ "# - num_neighbors: The number of neighbors used for graph regularization. This\n",
+ "# value has to be less than or equal to the `num_neighbors`\n",
+ "# argument used above in the GraphAugmentation component when\n",
+ "# invoking `nsl.tools.pack_nbrs`.\n",
+ "# - num_fc_units: The number of units in the fully connected layer of the\n",
+ "# neural network.\n",
+ "class HParams(object):\n",
+ " \"\"\"Hyperparameters used for training.\"\"\"\n",
+ " def __init__(self):\n",
+ " ### dataset parameters\n",
+ " # The following 3 values should match those defined in the Transform\n",
+ " # Component.\n",
+ " self.max_seq_length = 100\n",
+ " self.vocab_size = 10000\n",
+ " self.oov_size = 100\n",
+ " ### Neural Graph Learning parameters\n",
+ " self.distance_type = nsl.configs.DistanceType.L2\n",
+ " self.graph_regularization_multiplier = 0.1\n",
+ " # The following value has to be at most the value of 'num_neighbors' used\n",
+ " # in the GraphAugmentation component.\n",
+ " self.num_neighbors = 1\n",
+ " ### Model Architecture\n",
+ " self.num_embedding_dims = 16\n",
+ " self.num_fc_units = 64\n",
+ "\n",
+ "HPARAMS = HParams()\n",
+ "\n",
+ "\n",
+ "def optimizer_fn():\n",
+ " \"\"\"Returns an instance of `tf.Optimizer`.\"\"\"\n",
+ " return tf.compat.v1.train.RMSPropOptimizer(\n",
+ " learning_rate=0.0001, decay=1e-6)\n",
+ "\n",
+ "\n",
+ "def build_train_op(loss, global_step):\n",
+ " \"\"\"Builds a train op to optimize the given loss using gradient descent.\"\"\"\n",
+ " with tf.name_scope('train'):\n",
+ " optimizer = optimizer_fn()\n",
+ " train_op = optimizer.minimize(loss=loss, global_step=global_step)\n",
+ " return train_op\n",
+ "\n",
+ "\n",
+ "# Building the model:\n",
+ "#\n",
+ "# A neural network is created by stacking layers—this requires two main\n",
+ "# architectural decisions:\n",
+ "# * How many layers to use in the model?\n",
+ "# * How many *hidden units* to use for each layer?\n",
+ "#\n",
+ "# In this example, the input data consists of an array of word-indices. The\n",
+ "# labels to predict are either 0 or 1. We will use a feed-forward neural network\n",
+ "# as our base model in this tutorial.\n",
+ "def feed_forward_model(features, is_training, reuse=tf.compat.v1.AUTO_REUSE):\n",
+ " \"\"\"Builds a simple 2 layer feed forward neural network.\n",
+ "\n",
+ " The layers are effectively stacked sequentially to build the classifier. The\n",
+ " first layer is an Embedding layer, which takes the integer-encoded vocabulary\n",
+ " and looks up the embedding vector for each word-index. These vectors are\n",
+ " learned as the model trains. The vectors add a dimension to the output array.\n",
+ " The resulting dimensions are: (batch, sequence, embedding). Next is a global\n",
+ " average pooling 1D layer, which reduces the dimensionality of its inputs from\n",
+ " 3D to 2D. This fixed-length output vector is piped through a fully-connected\n",
+ " (Dense) layer with 16 hidden units. The last layer is densely connected with a\n",
+ " single output node. Using the sigmoid activation function, this value is a\n",
+ " float between 0 and 1, representing a probability, or confidence level.\n",
+ "\n",
+ " Args:\n",
+ " features: A dictionary containing batch features returned from the\n",
+ " `input_fn`, that include sample features, corresponding neighbor features,\n",
+ " and neighbor weights.\n",
+ " is_training: a Python Boolean value or a Boolean scalar Tensor, indicating\n",
+ " whether to apply dropout.\n",
+ " reuse: a Python Boolean value for reusing variable scope.\n",
+ "\n",
+ " Returns:\n",
+ " logits: Tensor of shape [batch_size, 1].\n",
+ " representations: Tensor of shape [batch_size, _] for graph regularization.\n",
+ " This is the representation of each example at the graph regularization\n",
+ " layer.\n",
+ " \"\"\"\n",
+ "\n",
+ " with tf.compat.v1.variable_scope('ff', reuse=reuse):\n",
+ " inputs = features[_transformed_name('text')]\n",
+ " embeddings = tf.compat.v1.get_variable(\n",
+ " 'embeddings',\n",
+ " shape=[\n",
+ " HPARAMS.vocab_size + HPARAMS.oov_size, HPARAMS.num_embedding_dims\n",
+ " ])\n",
+ " embedding_layer = tf.nn.embedding_lookup(embeddings, inputs)\n",
+ "\n",
+ " pooling_layer = tf.compat.v1.layers.AveragePooling1D(\n",
+ " pool_size=HPARAMS.max_seq_length, strides=HPARAMS.max_seq_length)(\n",
+ " embedding_layer)\n",
+ " # Shape of pooling_layer is now [batch_size, 1, HPARAMS.num_embedding_dims]\n",
+ " pooling_layer = tf.reshape(pooling_layer, [-1, HPARAMS.num_embedding_dims])\n",
+ "\n",
+ " dense_layer = tf.compat.v1.layers.Dense(\n",
+ " 16, activation='relu')(\n",
+ " pooling_layer)\n",
+ "\n",
+ " output_layer = tf.compat.v1.layers.Dense(\n",
+ " 1, activation='sigmoid')(\n",
+ " dense_layer)\n",
+ "\n",
+ " # Graph regularization will be done on the penultimate (dense) layer\n",
+ " # because the output layer is a single floating point number.\n",
+ " return output_layer, dense_layer\n",
+ "\n",
+ "\n",
+ "# A note on hidden units:\n",
+ "#\n",
+ "# The above model has two intermediate or \"hidden\" layers, between the input and\n",
+ "# output, and excluding the Embedding layer. The number of outputs (units,\n",
+ "# nodes, or neurons) is the dimension of the representational space for the\n",
+ "# layer. In other words, the amount of freedom the network is allowed when\n",
+ "# learning an internal representation. If a model has more hidden units\n",
+ "# (a higher-dimensional representation space), and/or more layers, then the\n",
+ "# network can learn more complex representations. However, it makes the network\n",
+ "# more computationally expensive and may lead to learning unwanted\n",
+ "# patterns—patterns that improve performance on training data but not on the\n",
+ "# test data. This is called overfitting.\n",
+ "\n",
+ "\n",
+ "# This function will be used to generate the embeddings for samples and their\n",
+ "# corresponding neighbors, which will then be used for graph regularization.\n",
+ "def embedding_fn(features, mode, **params):\n",
+ " \"\"\"Returns the embedding corresponding to the given features.\n",
+ "\n",
+ " Args:\n",
+ " features: A dictionary containing batch features returned from the\n",
+ " `input_fn`, that include sample features, corresponding neighbor features,\n",
+ " and neighbor weights.\n",
+ " mode: Specifies if this is training, evaluation, or prediction. See\n",
+ " tf.estimator.ModeKeys.\n",
+ "\n",
+ " Returns:\n",
+ " The embedding that will be used for graph regularization.\n",
+ " \"\"\"\n",
+ " is_training = (mode == tf.estimator.ModeKeys.TRAIN)\n",
+ " _, embedding = feed_forward_model(features, is_training)\n",
+ " return embedding\n",
+ "\n",
+ "\n",
+ "def feed_forward_model_fn(features, labels, mode, params, config):\n",
+ " \"\"\"Implementation of the model_fn for the base feed-forward model.\n",
+ "\n",
+ " Args:\n",
+ " features: This is the first item returned from the `input_fn` passed to\n",
+ " `train`, `evaluate`, and `predict`. This should be a single `Tensor` or\n",
+ " `dict` of same.\n",
+ " labels: This is the second item returned from the `input_fn` passed to\n",
+ " `train`, `evaluate`, and `predict`. This should be a single `Tensor` or\n",
+ " `dict` of same (for multi-head models). If mode is `ModeKeys.PREDICT`,\n",
+ " `labels=None` will be passed. If the `model_fn`'s signature does not\n",
+ " accept `mode`, the `model_fn` must still be able to handle `labels=None`.\n",
+ " mode: Optional. Specifies if this training, evaluation or prediction. See\n",
+ " `ModeKeys`.\n",
+ " params: An HParams instance as returned by get_hyper_parameters().\n",
+ " config: Optional configuration object. Will receive what is passed to\n",
+ " Estimator in `config` parameter, or the default `config`. Allows updating\n",
+ " things in your model_fn based on configuration such as `num_ps_replicas`,\n",
+ " or `model_dir`. Unused currently.\n",
+ "\n",
+ " Returns:\n",
+ " A `tf.estimator.EstimatorSpec` for the base feed-forward model. This does\n",
+ " not include graph-based regularization.\n",
+ " \"\"\"\n",
+ "\n",
+ " is_training = mode == tf.estimator.ModeKeys.TRAIN\n",
+ "\n",
+ " # Build the computation graph.\n",
+ " probabilities, _ = feed_forward_model(features, is_training)\n",
+ " predictions = tf.round(probabilities)\n",
+ "\n",
+ " if mode == tf.estimator.ModeKeys.PREDICT:\n",
+ " # labels will be None, and no loss to compute.\n",
+ " cross_entropy_loss = None\n",
+ " eval_metric_ops = None\n",
+ " else:\n",
+ " # Loss is required in train and eval modes.\n",
+ " # Flatten 'probabilities' to 1-D.\n",
+ " probabilities = tf.reshape(probabilities, shape=[-1])\n",
+ " cross_entropy_loss = tf.compat.v1.keras.losses.binary_crossentropy(\n",
+ " labels, probabilities)\n",
+ " eval_metric_ops = {\n",
+ " 'accuracy': tf.compat.v1.metrics.accuracy(labels, predictions)\n",
+ " }\n",
+ "\n",
+ " if is_training:\n",
+ " global_step = tf.compat.v1.train.get_or_create_global_step()\n",
+ " train_op = build_train_op(cross_entropy_loss, global_step)\n",
+ " else:\n",
+ " train_op = None\n",
+ "\n",
+ " return tf.estimator.EstimatorSpec(\n",
+ " mode=mode,\n",
+ " predictions={\n",
+ " 'probabilities': probabilities,\n",
+ " 'predictions': predictions\n",
+ " },\n",
+ " loss=cross_entropy_loss,\n",
+ " train_op=train_op,\n",
+ " eval_metric_ops=eval_metric_ops)\n",
+ "\n",
+ "\n",
+ "# Tf.Transform considers these features as \"raw\"\n",
+ "def _get_raw_feature_spec(schema):\n",
+ " return schema_utils.schema_as_feature_spec(schema).feature_spec\n",
+ "\n",
+ "\n",
+ "def _gzip_reader_fn(filenames):\n",
+ " \"\"\"Small utility returning a record reader that can read gzip'ed files.\"\"\"\n",
+ " return tf.data.TFRecordDataset(\n",
+ " filenames,\n",
+ " compression_type='GZIP')\n",
+ "\n",
+ "\n",
+ "def _example_serving_receiver_fn(tf_transform_output, schema):\n",
+ " \"\"\"Build the serving in inputs.\n",
+ "\n",
+ " Args:\n",
+ " tf_transform_output: A TFTransformOutput.\n",
+ " schema: the schema of the input data.\n",
+ "\n",
+ " Returns:\n",
+ " Tensorflow graph which parses examples, applying tf-transform to them.\n",
+ " \"\"\"\n",
+ " raw_feature_spec = _get_raw_feature_spec(schema)\n",
+ " raw_feature_spec.pop(LABEL_KEY)\n",
+ "\n",
+ " # We don't need the ID feature for serving.\n",
+ " raw_feature_spec.pop(ID_FEATURE_KEY)\n",
+ "\n",
+ " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
+ " raw_feature_spec, default_batch_size=None)\n",
+ " serving_input_receiver = raw_input_fn()\n",
+ "\n",
+ " transformed_features = tf_transform_output.transform_raw_features(\n",
+ " serving_input_receiver.features)\n",
+ "\n",
+ " # Even though, LABEL_KEY was removed from 'raw_feature_spec', the transform\n",
+ " # operation would have injected the transformed LABEL_KEY feature with a\n",
+ " # default value.\n",
+ " transformed_features.pop(_transformed_name(LABEL_KEY))\n",
+ " return tf.estimator.export.ServingInputReceiver(\n",
+ " transformed_features, serving_input_receiver.receiver_tensors)\n",
+ "\n",
+ "\n",
+ "def _eval_input_receiver_fn(tf_transform_output, schema):\n",
+ " \"\"\"Build everything needed for the tf-model-analysis to run the model.\n",
+ "\n",
+ " Args:\n",
+ " tf_transform_output: A TFTransformOutput.\n",
+ " schema: the schema of the input data.\n",
+ "\n",
+ " Returns:\n",
+ " EvalInputReceiver function, which contains:\n",
+ " - Tensorflow graph which parses raw untransformed features, applies the\n",
+ " tf-transform preprocessing operators.\n",
+ " - Set of raw, untransformed features.\n",
+ " - Label against which predictions will be compared.\n",
+ " \"\"\"\n",
+ " # Notice that the inputs are raw features, not transformed features here.\n",
+ " raw_feature_spec = _get_raw_feature_spec(schema)\n",
+ "\n",
+ " # We don't need the ID feature for TFMA.\n",
+ " raw_feature_spec.pop(ID_FEATURE_KEY)\n",
+ "\n",
+ " raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
+ " raw_feature_spec, default_batch_size=None)\n",
+ " serving_input_receiver = raw_input_fn()\n",
+ "\n",
+ " transformed_features = tf_transform_output.transform_raw_features(\n",
+ " serving_input_receiver.features)\n",
+ "\n",
+ " labels = transformed_features.pop(_transformed_name(LABEL_KEY))\n",
+ " return tfma.export.EvalInputReceiver(\n",
+ " features=transformed_features,\n",
+ " receiver_tensors=serving_input_receiver.receiver_tensors,\n",
+ " labels=labels)\n",
+ "\n",
+ "\n",
+ "def _augment_feature_spec(feature_spec, num_neighbors):\n",
+ " \"\"\"Augments `feature_spec` to include neighbor features.\n",
+ " Args:\n",
+ " feature_spec: Dictionary of feature keys mapping to TF feature types.\n",
+ " num_neighbors: Number of neighbors to use for feature key augmentation.\n",
+ " Returns:\n",
+ " An augmented `feature_spec` that includes neighbor feature keys.\n",
+ " \"\"\"\n",
+ " for i in range(num_neighbors):\n",
+ " feature_spec['{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'id')] = \\\n",
+ " tf.io.VarLenFeature(dtype=tf.string)\n",
+ " # We don't care about the neighbor features corresponding to\n",
+ " # _transformed_name(LABEL_KEY) because the LABEL_KEY feature will be\n",
+ " # removed from the feature spec during training/evaluation.\n",
+ " feature_spec['{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'text_xf')] = \\\n",
+ " tf.io.FixedLenFeature(shape=[HPARAMS.max_seq_length], dtype=tf.int64,\n",
+ " default_value=tf.constant(0, dtype=tf.int64,\n",
+ " shape=[HPARAMS.max_seq_length]))\n",
+ " # The 'NL_num_nbrs' features is currently not used.\n",
+ "\n",
+ " # Set the neighbor weight feature keys.\n",
+ " for i in range(num_neighbors):\n",
+ " feature_spec['{}{}{}'.format(NBR_FEATURE_PREFIX, i, NBR_WEIGHT_SUFFIX)] = \\\n",
+ " tf.io.FixedLenFeature(shape=[1], dtype=tf.float32, default_value=[0.0])\n",
+ "\n",
+ " return feature_spec\n",
+ "\n",
+ "\n",
+ "def _input_fn(filenames, tf_transform_output, is_training, batch_size=200):\n",
+ " \"\"\"Generates features and labels for training or evaluation.\n",
+ "\n",
+ " Args:\n",
+ " filenames: [str] list of CSV files to read data from.\n",
+ " tf_transform_output: A TFTransformOutput.\n",
+ " is_training: Boolean indicating if we are in training mode.\n",
+ " batch_size: int First dimension size of the Tensors returned by input_fn\n",
+ "\n",
+ " Returns:\n",
+ " A (features, indices) tuple where features is a dictionary of\n",
+ " Tensors, and indices is a single Tensor of label indices.\n",
+ " \"\"\"\n",
+ " transformed_feature_spec = (\n",
+ " tf_transform_output.transformed_feature_spec().copy())\n",
+ "\n",
+ " # During training, NSL uses augmented training data (which includes features\n",
+ " # from graph neighbors). So, update the feature spec accordingly. This needs\n",
+ " # to be done because we are using different schemas for NSL training and eval,\n",
+ " # but the Trainer Component only accepts a single schema.\n",
+ " if is_training:\n",
+ " transformed_feature_spec =_augment_feature_spec(transformed_feature_spec,\n",
+ " HPARAMS.num_neighbors)\n",
+ "\n",
+ " dataset = tf.data.experimental.make_batched_features_dataset(\n",
+ " filenames, batch_size, transformed_feature_spec, reader=_gzip_reader_fn)\n",
+ "\n",
+ " transformed_features = tf.compat.v1.data.make_one_shot_iterator(\n",
+ " dataset).get_next()\n",
+ " # We pop the label because we do not want to use it as a feature while we're\n",
+ " # training.\n",
+ " return transformed_features, transformed_features.pop(\n",
+ " _transformed_name(LABEL_KEY))\n",
+ "\n",
+ "\n",
+ "# TFX will call this function\n",
+ "def trainer_fn(hparams, schema):\n",
+ " \"\"\"Build the estimator using the high level API.\n",
+ " Args:\n",
+ " hparams: Holds hyperparameters used to train the model as name/value pairs.\n",
+ " schema: Holds the schema of the training examples.\n",
+ " Returns:\n",
+ " A dict of the following:\n",
+ " - estimator: The estimator that will be used for training and eval.\n",
+ " - train_spec: Spec for training.\n",
+ " - eval_spec: Spec for eval.\n",
+ " - eval_input_receiver_fn: Input function for eval.\n",
+ " \"\"\"\n",
+ " train_batch_size = 40\n",
+ " eval_batch_size = 40\n",
+ "\n",
+ " tf_transform_output = tft.TFTransformOutput(hparams.transform_output)\n",
+ "\n",
+ " train_input_fn = lambda: _input_fn(\n",
+ " hparams.train_files,\n",
+ " tf_transform_output,\n",
+ " is_training=True,\n",
+ " batch_size=train_batch_size)\n",
+ "\n",
+ " eval_input_fn = lambda: _input_fn(\n",
+ " hparams.eval_files,\n",
+ " tf_transform_output,\n",
+ " is_training=False,\n",
+ " batch_size=eval_batch_size)\n",
+ "\n",
+ " train_spec = tf.estimator.TrainSpec(\n",
+ " train_input_fn,\n",
+ " max_steps=hparams.train_steps)\n",
+ "\n",
+ " serving_receiver_fn = lambda: _example_serving_receiver_fn(\n",
+ " tf_transform_output, schema)\n",
+ "\n",
+ " exporter = tf.estimator.FinalExporter('imdb', serving_receiver_fn)\n",
+ " eval_spec = tf.estimator.EvalSpec(\n",
+ " eval_input_fn,\n",
+ " steps=hparams.eval_steps,\n",
+ " exporters=[exporter],\n",
+ " name='imdb-eval')\n",
+ "\n",
+ " run_config = tf.estimator.RunConfig(\n",
+ " save_checkpoints_steps=999, keep_checkpoint_max=1)\n",
+ "\n",
+ " run_config = run_config.replace(model_dir=hparams.serving_model_dir)\n",
+ "\n",
+ " estimator = tf.estimator.Estimator(\n",
+ " model_fn=feed_forward_model_fn, config=run_config, params=HPARAMS)\n",
+ "\n",
+ " # Create a graph regularization config.\n",
+ " graph_reg_config = nsl.configs.make_graph_reg_config(\n",
+ " max_neighbors=HPARAMS.num_neighbors,\n",
+ " multiplier=HPARAMS.graph_regularization_multiplier,\n",
+ " distance_type=HPARAMS.distance_type,\n",
+ " sum_over_axis=-1)\n",
+ "\n",
+ " # Invoke the Graph Regularization Estimator wrapper to incorporate\n",
+ " # graph-based regularization for training.\n",
+ " graph_nsl_estimator = nsl.estimator.add_graph_regularization(\n",
+ " estimator,\n",
+ " embedding_fn,\n",
+ " optimizer_fn=optimizer_fn,\n",
+ " graph_reg_config=graph_reg_config)\n",
+ "\n",
+ " # Create an input receiver for TFMA processing\n",
+ " receiver_fn = lambda: _eval_input_receiver_fn(\n",
+ " tf_transform_output, schema)\n",
+ "\n",
+ " return {\n",
+ " 'estimator': graph_nsl_estimator,\n",
+ " 'train_spec': train_spec,\n",
+ " 'eval_spec': eval_spec,\n",
+ " 'eval_input_receiver_fn': receiver_fn\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GnLjStUJIoos"
+ },
+ "source": [
+ "Create and run the `Trainer` component, passing it the file that we created above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MWLQI6t0b2pg"
+ },
+ "outputs": [],
+ "source": [
+ "# Uses user-provided Python function that implements a model using TensorFlow's\n",
+ "# Estimators API.\n",
+ "trainer = Trainer(\n",
+ " module_file=_trainer_module_file,\n",
+ " custom_executor_spec=executor_spec.ExecutorClassSpec(\n",
+ " trainer_executor.Executor),\n",
+ " transformed_examples=graph_augmentation.outputs['augmented_examples'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " transform_graph=transform.outputs['transform_graph'],\n",
+ " train_args=trainer_pb2.TrainArgs(num_steps=10000),\n",
+ " eval_args=trainer_pb2.EvalArgs(num_steps=5000))\n",
+ "context.run(trainer)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pDiZvYbFb2ph"
+ },
+ "source": [
+ "Take a peek at the trained model which was exported from `Trainer`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "qDBZG9Oso-BD"
+ },
+ "outputs": [],
+ "source": [
+ "train_uri = trainer.outputs['model'].get()[0].uri\n",
+ "serving_model_path = os.path.join(train_uri, 'Format-Serving')\n",
+ "exported_model = tf.saved_model.load(serving_model_path)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "KyT3ZVGCZWsj"
+ },
+ "outputs": [],
+ "source": [
+ "exported_model.graph.get_operations()[:10] + [\"...\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zIsspBf5GjKm"
+ },
+ "source": [
+ "Let's visualize the model's metrics using Tensorboard."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rnKeqLmcGqHH"
+ },
+ "outputs": [],
+ "source": [
+ "#docs_infra: no_execute\n",
+ "\n",
+ "# Get the URI of the output artifact representing the training logs,\n",
+ "# which is a directory\n",
+ "model_run_dir = trainer.outputs['model_run'].get()[0].uri\n",
+ "\n",
+ "%load_ext tensorboard\n",
+ "%tensorboard --logdir {model_run_dir}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LgZXZJBsGzHm"
+ },
+ "source": [
+ "## Model Serving\n",
+ "\n",
+ "Graph regularization only affects the training workflow by adding a regularization term to the loss function. As a result, the model evaluation and serving workflows remain unchanged. It is for the same reason that we've also omitted downstream TFX components that typically come after the *Trainer* component like the *Evaluator*, *Pusher*, etc."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qOh5FjbWiP-b"
+ },
+ "source": [
+ "## Conclusion\n",
+ "\n",
+ "We have demonstrated the use of graph regularization using the Neural Structured\n",
+ "Learning (NSL) framework in a TFX pipeline even when the input does not contain\n",
+ "an explicit graph. We considered the task of sentiment classification of IMDB\n",
+ "movie reviews for which we synthesized a similarity graph based on review\n",
+ "embeddings. We encourage users to experiment further by using different\n",
+ "embeddings for graph construction, varying hyperparameters, changing the amount\n",
+ "of supervision, and by defining different model architectures."
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "collapsed_sections": [
+ "24gYiJcWNlpA"
+ ],
+ "name": "Neural_Structured_Learning.ipynb",
+ "private_outputs": true,
+ "provenance": [],
+ "toc_visible": true
+ },
+ "file_extension": ".py",
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "mimetype": "text/x-python",
+ "name": "python",
+ "npconvert_exporter": "python",
+ "orig_nbformat": 2,
+ "pygments_lexer": "ipython3",
+ "version": 3
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
}
diff --git a/docs/tutorials/tfx/penguin_template.ipynb b/docs/tutorials/tfx/penguin_template.ipynb
index 4d343e35cc..d6d41c731e 100644
--- a/docs/tutorials/tfx/penguin_template.ipynb
+++ b/docs/tutorials/tfx/penguin_template.ipynb
@@ -1,53 +1,53 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "DjUA6S30k52h"
- },
- "source": [
- "##### Copyright 2020 The TensorFlow Authors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "SpNWyqewk8fE"
- },
- "outputs": [],
- "source": [
- "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "6TyrY7lV0oke"
- },
- "source": [
- "# Create a TFX pipeline for your data with Penguin template\n",
- "\n",
- "---\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ZQmvgl9nsqPW"
- },
- "source": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DjUA6S30k52h"
+ },
+ "source": [
+ "##### Copyright 2020 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SpNWyqewk8fE"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6TyrY7lV0oke"
+ },
+ "source": [
+ "# Create a TFX pipeline for your data with Penguin template\n",
+ "\n",
+ "---\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZQmvgl9nsqPW"
+ },
+ "source": [
"Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n",
"\n",
"\n",
@@ -84,1494 +84,1494 @@
" \n",
"
"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "iLYriYe10okf"
- },
- "source": [
- "## Introduction\n",
- "\n",
- "This document will provide instructions to create a TensorFlow Extended (TFX)\n",
- "pipeline for your own dataset using *penguin template* which is provided with\n",
- "TFX Python package. Created pipeline will be using\n",
- "[Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)\n",
- "dataset initially,\n",
- "but we will transform the pipeline for your dataset.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "M0HDv9FAbUy9"
- },
- "source": [
- "### Prerequisites\n",
- "- Linux / MacOS\n",
- "- Python 3.6-3.8\n",
- "- Jupyter notebook\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "XaXSIXh9czAX"
- },
- "source": [
- "## Step 1. Copy the predefined template to your project directory.\n",
- "In this step, we will create a working pipeline project directory and files\n",
- "by copying files from *penguin template* in TFX. You can think of this as a\n",
- "scaffold for your TFX pipeline project."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "WoUaCyNF_YlF"
- },
- "source": [
- "### Update Pip\n",
- "\n",
- "If we're running in Colab then we should make sure that we have the latest version of Pip. Local systems can of course be updated separately.\n",
- "\n",
- "Note: Updating is probably also a good idea if you are running in Vertex AI Workbench."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "iQIUEpFp_ZA2"
- },
- "outputs": [],
- "source": [
- "import sys\n",
- "if 'google.colab' in sys.modules:\n",
- " !pip install --upgrade pip"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "VmVR97AgacoP"
- },
- "source": [
- "### Install required package\n",
- "First, install TFX and TensorFlow Model Analysis (TFMA).\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XNiqq_kN0okj"
- },
- "outputs": [],
- "source": [
- "!pip install -U tfx tensorflow-model-analysis"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "hX1rqpbQ0okp"
- },
- "source": [
- "Let's check the versions of TFX."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "XAIoKMNG0okq"
- },
- "outputs": [],
- "source": [
- "import tensorflow as tf\n",
- "import tensorflow_model_analysis as tfma\n",
- "import tfx\n",
- "\n",
- "print('TF version: {}'.format(tf.__version__))\n",
- "print('TFMA version: {}'.format(tfma.__version__))\n",
- "print('TFX version: {}'.format(tfx.__version__))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TOsQbkky0ok7"
- },
- "source": [
- "We are ready to create a pipeline.\n",
- "\n",
- "Set `PROJECT_DIR` to appropriate destination for your environment.\n",
- "Default value is `~/imported/${PIPELINE_NAME}` which is appropriate for\n",
- "[Google Cloud AI Platform Notebook](https://console.cloud.google.com/ai-platform/notebooks/)\n",
- "environment.\n",
- "\n",
- "You may give your pipeline a different name by changing the `PIPELINE_NAME`\n",
- "below. This will also become the name of the project directory where your\n",
- "files will be put.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "cIPlt-700ok-"
- },
- "outputs": [],
- "source": [
- "PIPELINE_NAME=\"my_pipeline\"\n",
- "import os\n",
- "# Set this project directory to your new tfx pipeline project.\n",
- "PROJECT_DIR=os.path.join(os.path.expanduser(\"~\"), \"imported\", PIPELINE_NAME)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ozHIomcd0olB"
- },
- "source": [
- "### Copy template files.\n",
- "\n",
- "TFX includes the `penguin` template with the TFX python package.\n",
- "`penguin` template\n",
- "contains many instructions to bring your dataset into the pipeline which is\n",
- "the purpose of this tutorial.\n",
- "\n",
- "The `tfx template copy` CLI command copies predefined template files into your\n",
- "project directory."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "VLXpTTjU0olD"
- },
- "outputs": [],
- "source": [
- "# Set `PATH` to include user python binary directory and a directory containing `skaffold`.\n",
- "PATH=%env PATH\n",
- "%env PATH={PATH}:/home/jupyter/.local/bin\n",
- "\n",
- "!tfx template copy \\\n",
- " --pipeline-name={PIPELINE_NAME} \\\n",
- " --destination-path={PROJECT_DIR} \\\n",
- " --model=penguin"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "yxOT19QS0olH"
- },
- "source": [
- "Change the working directory context in this notebook to the project directory."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "6P-HljcU0olI"
- },
- "outputs": [],
- "source": [
- "%cd {PROJECT_DIR}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "1tEYUQxH0olO"
- },
- "source": [
- "\u003eNOTE: If you are using JupyterLab or Google Cloud AI Platform Notebook,\n",
- "don't forget to change directory in `File Browser` on the left by clicking\n",
- "into the project directory once it is created."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "IzT2PFrN0olQ"
- },
- "source": [
- "### Browse your copied source files\n",
- "\n",
- "The TFX template provides basic scaffold files to build a pipeline,\n",
- "including Python source code and sample data. The `penguin` template uses the\n",
- "same *Palmer Penguins* dataset and ML model as the\n",
- "[Penguin example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/penguin).\n",
- "\n",
- "Here is brief introduction to each of the Python files.\n",
- "- `pipeline` - This directory contains the definition of the pipeline\n",
- " - `configs.py` — defines common constants for pipeline runners\n",
- " - `pipeline.py` — defines TFX components and a pipeline\n",
- "- `models` - This directory contains ML model definitions\n",
- " - `features.py`, `features_test.py` — defines features for the model\n",
- " - `preprocessing.py`, `preprocessing_test.py` — defines preprocessing\n",
- " routines for data\n",
- " - `constants.py` — defines constants of the model\n",
- " - `model.py`, `model_test.py` — defines ML model using ML frameworks\n",
- " like TensorFlow\n",
- "- `local_runner.py` — define a runner for local environment which uses\n",
- "local orchestration engine\n",
- "- `kubeflow_runner.py` — define a runner for Kubeflow Pipelines\n",
- "orchestration engine\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "saF1CpVefaf1"
- },
- "source": [
- "By default, the template only includes standard TFX components. If you need\n",
- "some customized actions, you can create custom components for your pipeline.\n",
- "Please see\n",
- "[TFX custom component guide](../../../guide/understanding_custom_components)\n",
- "for the detail."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ROwHAsDK0olT"
- },
- "source": [
- "#### Unit-test files.\n",
- "\n",
- "You might notice that there are some files with `_test.py` in their name.\n",
- "These are unit tests of the pipeline and it is recommended to add more unit\n",
- "tests as you implement your own pipelines.\n",
- "You can run unit tests by supplying the module name of test files with `-m`\n",
- "flag. You can usually get a module name by deleting `.py` extension and\n",
- "replacing `/` with `.`. For example:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "M0cMdE2Z0olU"
- },
- "outputs": [],
- "source": [
- "import sys\n",
- "!{sys.executable} -m models.features_test"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "tO9Jhplo0olX"
- },
- "source": [
- "### Create a TFX pipeline in local environment.\n",
- "\n",
- "TFX supports several orchestration engines to run pipelines. We will use\n",
- "local orchestration engine. Local orchestration engine runs without any further dependencies, and it is suitable for development and debugging because it runs\n",
- "on local environment rather than depends on remote computing clusters."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ZO6yR8lOo5GZ"
- },
- "source": [
- "We will use `local_runner.py` to run your pipeline using local\n",
- "orchestrator. You have to create a pipeline before running it. You can create\n",
- "a pipeline with `pipeline create` command.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "_9unbcHlo7Yi"
- },
- "outputs": [],
- "source": [
- "!tfx pipeline create --engine=local --pipeline_path=local_runner.py"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "NrRL6R06o99S"
- },
- "source": [
- "`pipeline create` command registers your pipeline defined in `local_runner.py`\n",
- "without actually running it.\n",
- "\n",
- "You will run the created pipeline with `run create` command in following steps.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "7p73589GbTFi"
- },
- "source": [
- "## Step 2. Ingest YOUR data to the pipeline.\n",
- "\n",
- "The initial pipeline ingests the penguin dataset which is included in the\n",
- "template. You need to put your data into the pipeline, and most TFX\n",
- "pipelines start with ExampleGen component."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "AEVwa28qtjwi"
- },
- "source": [
- "### Choose an ExampleGen\n",
- "\n",
- "Your data can be stored anywhere your pipeline can access, on either a local or distributed filesystem, or a query-able system. TFX provides various\n",
- "[`ExampleGen` components](../../../guide/examplegen)\n",
- "to bring your data into a TFX pipeline. You can choose one from following\n",
- "example generating components.\n",
- "\n",
- "- CsvExampleGen: Reads CSV files in a directory. Used in\n",
- "[penguin example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/penguin)\n",
- "and\n",
- "[Chicago taxi example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline).\n",
- "- ImportExampleGen: Takes TFRecord files with TF Example data format. Used in\n",
- "[MNIST examples](https://github.com/tensorflow/tfx/tree/master/tfx/examples/mnist).\n",
- "- FileBasedExampleGen for\n",
- "[Avro](https://github.com/tensorflow/tfx/blob/master/tfx/components/example_gen/custom_executors/avro_executor.py)\n",
- "or\n",
- "[Parquet](https://github.com/tensorflow/tfx/blob/master/tfx/components/example_gen/custom_executors/parquet_executor.py)\n",
- "format.\n",
- "- [BigQueryExampleGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/extensions/google_cloud_big_query/example_gen/component/BigQueryExampleGen):\n",
- "Reads data in Google Cloud BigQuery directly. Used in\n",
- "[Chicago taxi examples](https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline).\n",
- "\n",
- "You can also create your own ExampleGen, for example, tfx includes\n",
- "[a custom ExecampleGen which uses Presto](https://github.com/tensorflow/tfx/tree/master/tfx/examples/custom_components/presto_example_gen)\n",
- "as a data source. See\n",
- "[the guide](../../../guide/examplegen#custom_examplegen)\n",
- "for more information on how to use and develop custom executors.\n",
- "\n",
- "Once you decide which ExampleGen to use, you will need to modify the pipeline\n",
- "definition to use your data.\n",
- "\n",
- "1. Modify the `DATA_PATH` in `local_runner.py` and set it to the location of\n",
- "your files.\n",
- " - If you have files in local environment, specify the path. This is the\n",
- " best option for developing or debugging a pipeline.\n",
- " - If the files are stored in GCS, you can use a path starting with\n",
- " `gs://{bucket_name}/...`. Please make sure that you can access GCS from\n",
- " your terminal, for example, using\n",
- " [`gsutil`](https://cloud.google.com/storage/docs/gsutil).\n",
- " Please follow\n",
- " [authorization guide in Google Cloud](https://cloud.google.com/sdk/docs/authorizing)\n",
- " if needed.\n",
- " - If you want to use a Query-based ExampleGen like BigQueryExampleGen, you\n",
- " need a Query statement to select data from the data source. There are a few\n",
- " more things you need to set to use Google Cloud BigQuery as a data source.\n",
- " - In `pipeline/configs.py`:\n",
- " - Change `GOOGLE_CLOUD_PROJECT` and `GCS_BUCKET_NAME` to your GCP\n",
- " project and bucket name. The bucket should exist before we run\n",
- " the pipeline.\n",
- " - Uncomment `BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS` variable.\n",
- " - Uncomment and set `BIG_QUERY_QUERY` variable to\n",
- " **your query statement**.\n",
- " - In `local_runner.py`:\n",
- " - Comment out `data_path` argument and uncomment `query` argument\n",
- " instead in `pipeline.create_pipeline()`.\n",
- " - In `pipeline/pipeline.py`:\n",
- " - Comment out `data_path` argument and uncomment `query` argument\n",
- " in `create_pipeline()`.\n",
- " - Use\n",
- " [BigQueryExampleGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/extensions/google_cloud_big_query/example_gen/component/BigQueryExampleGen)\n",
- " instead of CsvExampleGen.\n",
- "\n",
- "1. Replace existing CsvExampleGen to your ExampleGen class in\n",
- "`pipeline/pipeline.py`. Each ExampleGen class has different signature.\n",
- "Please see [ExampleGen component guide](../../../guide/examplegen) for more detail. Don't forget to import required modules with\n",
- "`import` statements in `pipeline/pipeline.py`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "xiAG6acjbl5U"
- },
- "source": [
- "The initial pipeline is consist of four components, `ExampleGen`,\n",
- "`StatisticsGen`, `SchemaGen` and `ExampleValidator`. We don't need to change\n",
- "anything for `StatisticsGen`, `SchemaGen` and `ExampleValidator`. Let's run the\n",
- "pipeline for the first time."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tKOI48WumF7h"
- },
- "outputs": [],
- "source": [
- "# Update and run the pipeline.\n",
- "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
- " \u0026\u0026 tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "eFbYFNVbbzna"
- },
- "source": [
- "You should see \"Component ExampleValidator is finished.\" if the pipeline ran successfully."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "uuD5FRPAcOn8"
- },
- "source": [
- "### Examine output of the pipeline."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "tL1wWoDh5wkj"
- },
- "source": [
- "TFX pipeline produces two kinds of output, artifacts and a\n",
- "[metadata DB(MLMD)](../../../guide/mlmd) which contains\n",
- "metadata of artifacts and pipeline executions. The location to the output is\n",
- "defined in `local_runner.py`. By default, artifacts are stored under\n",
- "`tfx_pipeline_output` directory and metadata is stored as an sqlite database\n",
- "under `tfx_metadata` directory.\n",
- "\n",
- "You can use MLMD APIs to examine these outputs. First, we will define some\n",
- "utility functions to search output artifacts that were just produced."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "K0i_jTvOI8mv"
- },
- "outputs": [],
- "source": [
- "import tensorflow as tf\n",
- "import tfx\n",
- "from ml_metadata import errors\n",
- "from ml_metadata.proto import metadata_store_pb2\n",
- "from tfx.types import artifact_utils\n",
- "\n",
- "# TODO(b/171447278): Move these functions into TFX library.\n",
- "\n",
- "def get_latest_executions(store, pipeline_name, component_id = None):\n",
- " \"\"\"Fetch all pipeline runs.\"\"\"\n",
- " if component_id is None: # Find entire pipeline runs.\n",
- " run_contexts = [\n",
- " c for c in store.get_contexts_by_type('run')\n",
- " if c.properties['pipeline_name'].string_value == pipeline_name\n",
- " ]\n",
- " else: # Find specific component runs.\n",
- " run_contexts = [\n",
- " c for c in store.get_contexts_by_type('component_run')\n",
- " if c.properties['pipeline_name'].string_value == pipeline_name and\n",
- " c.properties['component_id'].string_value == component_id\n",
- " ]\n",
- " if not run_contexts:\n",
- " return []\n",
- " # Pick the latest run context.\n",
- " latest_context = max(run_contexts,\n",
- " key=lambda c: c.last_update_time_since_epoch)\n",
- " return store.get_executions_by_context(latest_context.id)\n",
- "\n",
- "def get_latest_artifacts(store, pipeline_name, component_id = None):\n",
- " \"\"\"Fetch all artifacts from latest pipeline execution.\"\"\"\n",
- " executions = get_latest_executions(store, pipeline_name, component_id)\n",
- "\n",
- " # Fetch all artifacts produced from the given executions.\n",
- " execution_ids = [e.id for e in executions]\n",
- " events = store.get_events_by_execution_ids(execution_ids)\n",
- " artifact_ids = [\n",
- " event.artifact_id for event in events\n",
- " if event.type == metadata_store_pb2.Event.OUTPUT\n",
- " ]\n",
- " return store.get_artifacts_by_id(artifact_ids)\n",
- "\n",
- "def find_latest_artifacts_by_type(store, artifacts, artifact_type):\n",
- " \"\"\"Get the latest artifacts of a specified type.\"\"\"\n",
- " # Get type information from MLMD\n",
- " try:\n",
- " artifact_type = store.get_artifact_type(artifact_type)\n",
- " except errors.NotFoundError:\n",
- " return []\n",
- " # Filter artifacts with type.\n",
- " filtered_artifacts = [aritfact for aritfact in artifacts\n",
- " if aritfact.type_id == artifact_type.id]\n",
- " # Convert MLMD artifact data into TFX Artifact instances.\n",
- " return [artifact_utils.deserialize_artifact(artifact_type, artifact)\n",
- " for artifact in filtered_artifacts]\n",
- "\n",
- "\n",
- "from tfx.orchestration.experimental.interactive import visualizations\n",
- "\n",
- "def visualize_artifacts(artifacts):\n",
- " \"\"\"Visualizes artifacts using standard visualization modules.\"\"\"\n",
- " for artifact in artifacts:\n",
- " visualization = visualizations.get_registry().get_visualization(\n",
- " artifact.type_name)\n",
- " if visualization:\n",
- " visualization.display(artifact)\n",
- "\n",
- "from tfx.orchestration.experimental.interactive import standard_visualizations\n",
- "standard_visualizations.register_standard_visualizations()\n",
- "\n",
- "import pprint\n",
- "\n",
- "from tfx.orchestration import metadata\n",
- "from tfx.types import artifact_utils\n",
- "from tfx.types import standard_artifacts\n",
- "\n",
- "def preview_examples(artifacts):\n",
- " \"\"\"Preview a few records from Examples artifacts.\"\"\"\n",
- " pp = pprint.PrettyPrinter()\n",
- " for artifact in artifacts:\n",
- " print(\"==== Examples artifact:{}({})\".format(artifact.name, artifact.uri))\n",
- " for split in artifact_utils.decode_split_names(artifact.split_names):\n",
- " print(\"==== Reading from split:{}\".format(split))\n",
- " split_uri = artifact_utils.get_split_uri([artifact], split)\n",
- "\n",
- " # Get the list of files in this directory (all compressed TFRecord files)\n",
- " tfrecord_filenames = [os.path.join(split_uri, name)\n",
- " for name in os.listdir(split_uri)]\n",
- " # Create a `TFRecordDataset` to read these files\n",
- " dataset = tf.data.TFRecordDataset(tfrecord_filenames,\n",
- " compression_type=\"GZIP\")\n",
- " # Iterate over the first 2 records and decode them.\n",
- " for tfrecord in dataset.take(2):\n",
- " serialized_example = tfrecord.numpy()\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(serialized_example)\n",
- " pp.pprint(example)\n",
- "\n",
- "import local_runner\n",
- "\n",
- "metadata_connection_config = metadata.sqlite_metadata_connection_config(\n",
- " local_runner.METADATA_PATH)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "cmwor9nVcmxy"
- },
- "source": [
- "Now we can read metadata of output artifacts from MLMD."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "TtsrZEUB1-J4"
- },
- "outputs": [],
- "source": [
- "with metadata.Metadata(metadata_connection_config) as metadata_handler:\n",
- " # Search all aritfacts from the previous pipeline run.\n",
- " artifacts = get_latest_artifacts(metadata_handler.store, PIPELINE_NAME)\n",
- " # Find artifacts of Examples type.\n",
- " examples_artifacts = find_latest_artifacts_by_type(\n",
- " metadata_handler.store, artifacts,\n",
- " standard_artifacts.Examples.TYPE_NAME)\n",
- " # Find artifacts generated from StatisticsGen.\n",
- " stats_artifacts = find_latest_artifacts_by_type(\n",
- " metadata_handler.store, artifacts,\n",
- " standard_artifacts.ExampleStatistics.TYPE_NAME)\n",
- " # Find artifacts generated from SchemaGen.\n",
- " schema_artifacts = find_latest_artifacts_by_type(\n",
- " metadata_handler.store, artifacts,\n",
- " standard_artifacts.Schema.TYPE_NAME)\n",
- " # Find artifacts generated from ExampleValidator.\n",
- " anomalies_artifacts = find_latest_artifacts_by_type(\n",
- " metadata_handler.store, artifacts,\n",
- " standard_artifacts.ExampleAnomalies.TYPE_NAME)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3U5MNAUIdBtN"
- },
- "source": [
- "Now we can examine outputs from each component.\n",
- "[Tensorflow Data Validation(TFDV)](https://www.tensorflow.org/tfx/data_validation/get_started)\n",
- "is used in `StatisticsGen`, `SchemaGen` and `ExampleValidator`, and TFDV can\n",
- "be used to visualize outputs from these components.\n",
- "\n",
- "In this tutorial, we will use visualzation helper methods in TFX which use TFDV\n",
- "internally to show the visualization. Please see\n",
- "[TFX components tutorial](/tutorials/tfx/components_keras)\n",
- "to learn more about each component."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "AxS6FsgU2IoZ"
- },
- "source": [
- "#### Examine output form ExampleGen\n",
- "\n",
- "Let's examine output from ExampleGen. Take a look at the first two examples for\n",
- "each split:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "3NWzXEcE13tW"
- },
- "outputs": [],
- "source": [
- "preview_examples(examples_artifacts)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Q0uiuPhkGEBz"
- },
- "source": [
- "By default, TFX ExampleGen divides examples into two splits, *train* and\n",
- "*eval*, but you can\n",
- "[adjust your split configuration](../../../guide/examplegen#span_version_and_split)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "yVh13wJu-IRv"
- },
- "source": [
- "#### Examine output from StatisticsGen\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "9LipxUp7-IRw",
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "visualize_artifacts(stats_artifacts)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8aebEY4c0Ju7"
- },
- "source": [
- "These statistics are supplied to SchemaGen to construct a schema of data\n",
- "automatically."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ExEKbdw8-IRx"
- },
- "source": [
- "#### Examine output from SchemaGen\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "M2IURBSp-IRy"
- },
- "outputs": [],
- "source": [
- "visualize_artifacts(schema_artifacts)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "oTvj8yeBHDdU"
- },
- "source": [
- "This schema is automatically inferred from the output of StatisticsGen.\n",
- "We will use this generated schema in this tutorial, but you also can\n",
- "[modify and customize the schema](../../../guide/statsgen#creating_a_curated_schema)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "rl1PEUgo-IRz"
- },
- "source": [
- "#### Examine output from ExampleValidator\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "F-4oAjGR-IR0",
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "visualize_artifacts(anomalies_artifacts)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "t026ZzbU0961"
- },
- "source": [
- "If any anomalies were found, you may review your data that all examples\n",
- "follow your assumptions. Outputs from other components like StatistcsGen might\n",
- "be useful. Found anomalies don't block the pipeline execution."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "lFMmqy1W-IR1"
- },
- "source": [
- "You can see the available features from the outputs of the `SchemaGen`. If\n",
- "your features can be used to construct ML model in `Trainer` directly, you can\n",
- "skip the next step and go to Step 4. Otherwise you can do some feature\n",
- "engineering work in the next step. `Transform` component is needed when\n",
- "full-pass operations like calculating averages are required, especially when\n",
- "you need to scale."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "bYH8Y2KB0olm"
- },
- "source": [
- "## Step 3. (Optional) Feature engineering with Transform component.\n",
- "\n",
- "In this step, you will define various feature engineering job which will be\n",
- "used by `Transform` component in the pipeline. See\n",
- "[Transform component guide](../../../guide/transform)\n",
- "for more information.\n",
- "\n",
- "This is only necessary if you training code requires additional feature(s)\n",
- "which is not available in the output of ExampleGen. Otherwise, feel free to\n",
- "fast forward to next step of using Trainer."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "qm_JjQUydbbb"
- },
- "source": [
- "### Define features of the model\n",
- "\n",
- "`models/features.py` contains constants to define features for the model\n",
- "including feature names, size of vocabulariy and so on. By default `penguin`\n",
- "template has two costants, `FEATURE_KEYS` and `LABEL_KEY`, because our `penguin`\n",
- "model solves a classification problem using supervised learning and all\n",
- "features are continuous numeric features. See\n",
- "[feature definitions from the chicago taxi example](https://github.com/tensorflow/tfx/blob/master/tfx/experimental/templates/taxi/models/features.py)\n",
- "for another example.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ATUeHXvJdcBn"
- },
- "source": [
- "### Implement preprocessing for training / serving in preprocessing_fn().\n",
- "\n",
- "Actual feature engineering happens in `preprocessing_fn()` function in\n",
- "`models/preprocessing.py`.\n",
- "\n",
- "In `preprocessing_fn` you can define a series of functions that manipulate the\n",
- "input dict of tensors to produce the output dict of tensors. There are helper\n",
- "functions like `scale_to_0_1` and `compute_and_apply_vocabulary` in the\n",
- "TensorFlow Transform API or you can simply use regular TensorFlow functions.\n",
- "By default `penguin` template includes example usages of\n",
- "[tft.scale_to_z_score](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/scale_to_z_score)\n",
- "function to normalize feature values.\n",
- "\n",
- "See [Tensflow Transform guide](https://www.tensorflow.org/tfx/transform/get_started)\n",
- "for more information about authoring `preprocessing_fn`.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "xUg_Lc43dbTp"
- },
- "source": [
- "### Add Transform component to the pipeline.\n",
- "\n",
- "If your preprocessing_fn is ready, add `Transform` component to the pipeline.\n",
- "\n",
- "1. In `pipeline/pipeline.py` file, uncomment `# components.append(transform)`\n",
- "to add the component to the pipeline.\n",
- "\n",
- "You can update the pipeline and run again."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "VE-Pqvto0olm"
- },
- "outputs": [],
- "source": [
- "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
- " \u0026\u0026 tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8q1ZYEHX0olo"
- },
- "source": [
- "If the pipeline ran successfully, you should see \"Component Transform is\n",
- "finished.\" *somewhere* in the log. Because `Transform` component and\n",
- "`ExampleValidator` component are not dependent to each other, the order of\n",
- "executions is not fixed. That said, either of `Transform` and\n",
- "`ExampleValidator` can be the last component in the pipeline execution."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "XrPEnZt0E_0m"
- },
- "source": [
- "### Examine output from Transform\n",
- "\n",
- "Transform component creates two kinds of outputs, a Tensorflow graph and\n",
- "transformed examples. The transformed examples are Examples artifact type which\n",
- "is also produced by ExampleGen, but this one contains transformed feature\n",
- "values instead.\n",
- "\n",
- "You can examine them as we did in the previous step."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "FvC5S66ZU5g6"
- },
- "outputs": [],
- "source": [
- "with metadata.Metadata(metadata_connection_config) as metadata_handler:\n",
- " # Search all aritfacts from the previous run of Transform component.\n",
- " artifacts = get_latest_artifacts(metadata_handler.store,\n",
- " PIPELINE_NAME, \"Transform\")\n",
- " # Find artifacts of Examples type.\n",
- " transformed_examples_artifacts = find_latest_artifacts_by_type(\n",
- " metadata_handler.store, artifacts,\n",
- " standard_artifacts.Examples.TYPE_NAME)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "CAFiEPKuC6Ib"
- },
- "outputs": [],
- "source": [
- "preview_examples(transformed_examples_artifacts)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "dWMBXU510olp"
- },
- "source": [
- "## Step 4. Train your model with Trainer component.\n",
- "\n",
- "We will build a ML model using `Trainer` component. See\n",
- "[Trainer component guide](../../../guide/trainer)\n",
- "for more information. You need to provide your model code to the Trainer\n",
- "component.\n",
- "\n",
- "### Define your model.\n",
- "\n",
- "In penguin template, `models.model.run_fn` is used as `run_fn` argument for\n",
- "`Trainer` component. It means that `run_fn()` function in `models/model.py`\n",
- "will be called when `Trainer` component runs. You can see the code to construct\n",
- "a simple DNN model using `keras` API in given code. See\n",
- "[TensorFlow 2.x in TFX](../../../guide/keras)\n",
- "guide for more information about using keras API in TFX.\n",
- "\n",
- "In this `run_fn`, you should build a model and save it to a directory pointed\n",
- "by `fn_args.serving_model_dir` which is specified by the component. You can use\n",
- "other arguments in `fn_args` which is passed into the `run_fn`. See\n",
- "[related codes](https://github.com/tensorflow/tfx/blob/b01482442891a49a1487c67047e85ab971717b75/tfx/components/trainer/executor.py#L141)\n",
- "for the full list of arguments in `fn_args`.\n",
- "\n",
- "Define your features in `models/features.py` and use them as needed. If you\n",
- "have transformed your features in Step 3, you should use transformed features\n",
- "as inputs to your model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "oFiLIaCm-IR4"
- },
- "source": [
- "### Add Trainer component to the pipeline.\n",
- "\n",
- "If your run_fn is ready, add `Trainer` component to the pipeline.\n",
- "\n",
- "1. In `pipeline/pipeline.py` file, uncomment `# components.append(trainer)`\n",
- "to add the component to the pipeline.\n",
- "\n",
- "Arguments for the trainer component might depends on whether you use Transform\n",
- "component or not.\n",
- "- If you do **NOT** use `Transform` component, you don't need to change the\n",
- "arguments.\n",
- "- If you use `Transform` component, you need to change arguments\n",
- "when creating a `Trainer` component instance.\n",
- " - Change `examples` argument to\n",
- " `examples=transform.outputs['transformed_examples'],`. We need to use\n",
- " transformed examples for training.\n",
- " - Add `transform_graph` argument like\n",
- " `transform_graph=transform.outputs['transform_graph'],`. This graph\n",
- " contains TensorFlow graph for the transform operations.\n",
- " - After above changes, the code for Trainer component creation will\n",
- " look like following.\n",
- "\n",
- " ```python\n",
- " # If you use a Transform component.\n",
- " trainer = Trainer(\n",
- " run_fn=run_fn,\n",
- " examples=transform.outputs['transformed_examples'],\n",
- " transform_graph=transform.outputs['transform_graph'],\n",
- " schema=schema_gen.outputs['schema'],\n",
- " ...\n",
- " ```\n",
- "\n",
- "You can update the pipeline and run again."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "VQDNitkH0olq"
- },
- "outputs": [],
- "source": [
- "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
- " \u0026\u0026 tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ksWfVQUnMYCX"
- },
- "source": [
- "When this execution runs successfully, you have now created and run your first\n",
- "TFX pipeline for your model. Congratulations!\n",
- "\n",
- "Your new model will be located in some place under the output directory, but it\n",
- "would be better to have a model in fixed location or service outside of the TFX\n",
- "pipeline which holds many interim results. Even better with continuous\n",
- "evaluation of the built model which is critical in ML production systems. We\n",
- "will see how continuous evaluation and deployments work in TFX in the next step."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "4DRTFdTy0ol3"
- },
- "source": [
- "## Step 5. (Optional) Evaluate the model with Evaluator and publish with pusher.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "5DID2nzH-IR7"
- },
- "source": [
- "[`Evaluator`](../../../guide/evaluator) component\n",
- "continuously evaluate every built model from `Trainer`, and\n",
- "[`Pusher`](../../../guide/pusher) copies the model to\n",
- "a predefined location in the file system or even to\n",
- "[Google Cloud AI Platform Models](https://console.cloud.google.com/ai-platform/models).\n",
- "\n",
- "### Adds Evaluator component to the pipeline.\n",
- "\n",
- "In `pipeline/pipeline.py` file:\n",
- "1. Uncomment `# components.append(model_resolver)` to add latest model resolver\n",
- "to the pipeline. Evaluator can be used to compare a model with old baseline\n",
- "model which passed Evaluator in last pipeline run. `LatestBlessedModelResolver`\n",
- "finds the latest model which passed Evaluator.\n",
- "1. Set proper `tfma.MetricsSpec` for your model. Evaluation might be different\n",
- "for every ML model. In the penguin template, `SparseCategoricalAccuracy` was used\n",
- "because we are solving a multi category classification problem. You also need\n",
- "to specify `tfma.SliceSpec` to analyze your model for specific slices. For more\n",
- "detail, see\n",
- "[Evaluator component guide](../../../guide/evaluator).\n",
- "1. Uncomment `# components.append(evaluator)` to add the component to the\n",
- "pipeline.\n",
- "\n",
- "You can update the pipeline and run again."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "i5_ojoZZmaDQ"
- },
- "outputs": [],
- "source": [
- "# Update and run the pipeline.\n",
- "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
- " \u0026\u0026 tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "apZX74qJ-IR7"
- },
- "source": [
- "### Examine output of Evaluator\n",
- "This step requires TensorFlow Model Analysis(TFMA) Jupyter notebook extension.\n",
- "Note that the version of the TFMA notebook extension should be identical to the\n",
- "version of TFMA python package.\n",
- "\n",
- "Following command will install TFMA notebook extension from NPM registry. It\n",
- "might take several minutes to complete."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "VoL46D5Pw5FX"
- },
- "outputs": [],
- "source": [
- "# Install TFMA notebook extension.\n",
- "!jupyter labextension install tensorflow_model_analysis@{tfma.__version__}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GKMo4j8ww5PB"
- },
- "source": [
- "If installation is completed, please **reload your browser** to make the\n",
- "extension take effect."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2ztotdqS-IR8"
- },
- "outputs": [],
- "source": [
- "with metadata.Metadata(metadata_connection_config) as metadata_handler:\n",
- " # Search all aritfacts from the previous pipeline run.\n",
- " artifacts = get_latest_artifacts(metadata_handler.store, PIPELINE_NAME)\n",
- " model_evaluation_artifacts = find_latest_artifacts_by_type(\n",
- " metadata_handler.store, artifacts,\n",
- " standard_artifacts.ModelEvaluation.TYPE_NAME)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "tVojwMCuDJuk"
- },
- "outputs": [],
- "source": [
- "if model_evaluation_artifacts:\n",
- " tfma_result = tfma.load_eval_result(model_evaluation_artifacts[0].uri)\n",
- " tfma.view.render_slicing_metrics(tfma_result)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "18tqjyHN-IR9"
- },
- "source": [
- "### Adds Pusher component to the pipeline.\n",
- "\n",
- "If the model looks promising, we need to publish the model.\n",
- "[Pusher component](../../../guide/pusher)\n",
- "can publish the model to a location in the filesystem or to GCP AI Platform\n",
- "Models using\n",
- "[a custom executor](https://github.com/tensorflow/tfx/blob/master/tfx/extensions/google_cloud_ai_platform/pusher/executor.py).\n",
- "\n",
- "`Evaluator` component continuously evaluate every built model from `Trainer`,\n",
- "and [`Pusher`](../../../guide/pusher) copies the model to\n",
- "a predefined location in the file system or even to\n",
- "[Google Cloud AI Platform Models](https://console.cloud.google.com/ai-platform/models).\n",
- "\n",
- "1. In `local_runner.py`, set `SERVING_MODEL_DIR` to a directory to publish.\n",
- "1. In `pipeline/pipeline.py` file, uncomment `# components.append(pusher)`\n",
- "to add Pusher to the pipeline.\n",
- "\n",
- "You can update the pipeline and run again."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "QH81d9FsrSXS"
- },
- "outputs": [],
- "source": [
- "# Update and run the pipeline.\n",
- "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
- " \u0026\u0026 tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_K6Z18tC-IR-"
- },
- "source": [
- "You should be able to find your new model at `SERVING_MODEL_DIR`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "20KRGsPX0ol3"
- },
- "source": [
- "## Step 6. (Optional) Deploy your pipeline to Kubeflow Pipelines on GCP.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "0X6vfy7s-IR-"
- },
- "source": [
- "As mentioned earlier, `local_runner.py` is good for debugging or development\n",
- "purpose but not a best solution for production workloads. In this step, we will\n",
- "deploy the pipeline to Kubeflow Pipelines on Google Cloud.\n",
- "\n",
- "### Preparation\n",
- "We need `kfp` python package and `skaffold` program to deploy a pipeline to a\n",
- "Kubeflow Pipelines cluster."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "Ge1bMUtU-IR_"
- },
- "outputs": [],
- "source": [
- "!pip install --upgrade -q kfp\n",
- "\n",
- "# Download skaffold and set it executable.\n",
- "!curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 \u0026\u0026 chmod +x skaffold"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ZsbnGq52-ISB"
- },
- "source": [
- "You need to move `skaffold` binary to the place where your shell can find it.\n",
- "Or you can specify the path to skaffold when you run `tfx` binary with\n",
- "`--skaffold-cmd` flag."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4amQ0Elz-ISC"
- },
- "outputs": [],
- "source": [
- "# Move skaffold binary into your path\n",
- "!mv skaffold /home/jupyter/.local/bin/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "1rmyns-o-ISD"
- },
- "source": [
- "You also need a Kubeflow Pipelines cluster to run the pipeline. Please\n",
- "follow Step 1 and 2 in\n",
- "[TFX on Cloud AI Platform Pipelines tutorial](/tutorials/tfx/cloud-ai-platform-pipelines).\n",
- "\n",
- "When your cluster is ready, open the pipeline dashboard by clicking\n",
- "*Open Pipelines Dashboard* in the\n",
- "[`Pipelines` page of the Google cloud console](http://console.cloud.google.com/ai-platform/pipelines).\n",
- "The URL of this page is `ENDPOINT` to request a pipeline run. The endpoint\n",
- "value is everything in the URL after the https://, up to, and including,\n",
- "googleusercontent.com. Put your endpoint to following code block.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "hGyj-Qa3-ISD"
- },
- "outputs": [],
- "source": [
- "ENDPOINT='' # Enter your ENDPOINT here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "igTo05YI-ISF"
- },
- "source": [
- "To run our code in a Kubeflow Pipelines cluster, we need to pack our code into\n",
- "a container image. The image will be built automatically while deploying our\n",
- "pipeline, and you only need to set a name and an container registry for your\n",
- "image. In our example, we will use\n",
- "[Google Container registry](https://cloud.google.com/container-registry),\n",
- "and name it `tfx-pipeline`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "5J3LrI0K-ISF"
- },
- "outputs": [],
- "source": [
- "# Read GCP project id from env.\n",
- "shell_output=!gcloud config list --format 'value(core.project)' 2\u003e/dev/null\n",
- "GOOGLE_CLOUD_PROJECT=shell_output[0]\n",
- "\n",
- "# Docker image name for the pipeline image.\n",
- "CUSTOM_TFX_IMAGE='gcr.io/' + GOOGLE_CLOUD_PROJECT + '/tfx-pipeline'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Gg11pLmU-ISH"
- },
- "source": [
- "### Set data location.\n",
- "\n",
- "Your data should be accessible from the Kubeflow Pipelines cluster. If you have\n",
- "used data in your local environment, you might need to upload it to remote\n",
- "storage like Google Cloud Storage. For example, we can upload penguin data to a\n",
- "default bucket which is created automatically when a Kubeflow Pipelines cluster\n",
- "is deployed like following."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "y8MmRIHi-ISH"
- },
- "outputs": [],
- "source": [
- "!gsutil cp data/data.csv gs://{GOOGLE_CLOUD_PROJECT}-kubeflowpipelines-default/tfx-template/data/penguin/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ASc1tDMm-ISJ"
- },
- "source": [
- "Update the data location stored at `DATA_PATH` in `kubeflow_runner.py`.\n",
- "\n",
- "If you are using BigQueryExampleGen, there is no need to upload the data file,\n",
- "but please make sure that `kubeflow_runner.py` uses the same `query` and\n",
- "`beam_pipeline_args` argument for `pipeline.create_pipeline()` function."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "q42gn3XS-ISK"
- },
- "source": [
- "### Deploy the pipeline.\n",
- "\n",
- "If everything is ready, you can create a pipeline using `tfx pipeline create`\n",
- "command.\n",
- "\u003e Note: When creating a pipeline for Kubeflow Pipelines, we need a container\n",
- "image which will be used to run our pipeline. And `skaffold` will build the\n",
- "image for us. Because `skaffold` pulls base images from the docker hub, it will\n",
- "take 5~10 minutes when we build the image for the first time, but it will take\n",
- "much less time from the second build.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ytZ0liBn-ISK"
- },
- "outputs": [],
- "source": [
- "!tfx pipeline create \\\n",
- "--engine=kubeflow \\\n",
- "--pipeline-path=kubeflow_runner.py \\\n",
- "--endpoint={ENDPOINT} \\\n",
- "--build-target-image={CUSTOM_TFX_IMAGE}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fFqUQxQG-ISM"
- },
- "source": [
- "Now start an execution run with the newly created pipeline using the\n",
- "`tfx run create` command."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "4ps-4RHz-ISM"
- },
- "outputs": [],
- "source": [
- "!tfx run create --engine=kubeflow --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Fx3LtAL0-ISN"
- },
- "source": [
- "Or, you can also run the pipeline in the Kubeflow Pipelines dashboard. The new\n",
- "run will be listed under `Experiments` in the Kubeflow Pipelines dashboard.\n",
- "Clicking into the experiment will allow you to monitor progress and visualize\n",
- "the artifacts created during the execution run."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "mP8W6zjD-ISO"
- },
- "source": [
- "If you are interested in running your pipeline on Kubeflow Pipelines,\n",
- "find more instructions in\n",
- "[TFX on Cloud AI Platform Pipelines tutorial](/tutorials/tfx/cloud-ai-platform-pipelines)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "PTsgD_Kz-ISO"
- },
- "source": [
- "### Cleaning up\n",
- "\n",
- "To clean up all Google Cloud resources used in this step, you can\n",
- "[delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)\n",
- "you used for the tutorial.\n",
- "\n",
- "Alternatively, you can clean up individual resources by visiting each\n",
- "consoles:\n",
- "- [Google Cloud Storage](https://console.cloud.google.com/storage)\n",
- "- [Google Container Registry](https://console.cloud.google.com/gcr)\n",
- "- [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes)"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [
- "DjUA6S30k52h"
- ],
- "name": "penguin_template.ipynb",
- "provenance": [],
- "toc_visible": true
- },
- "environment": {
- "name": "tf2-gpu.2-1.m46",
- "type": "gcloud",
- "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m46"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.6"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iLYriYe10okf"
+ },
+ "source": [
+ "## Introduction\n",
+ "\n",
+ "This document will provide instructions to create a TensorFlow Extended (TFX)\n",
+ "pipeline for your own dataset using *penguin template* which is provided with\n",
+ "TFX Python package. Created pipeline will be using\n",
+ "[Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)\n",
+ "dataset initially,\n",
+ "but we will transform the pipeline for your dataset.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "M0HDv9FAbUy9"
+ },
+ "source": [
+ "### Prerequisites\n",
+ "- Linux / MacOS\n",
+ "- Python 3.6-3.8\n",
+ "- Jupyter notebook\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XaXSIXh9czAX"
+ },
+ "source": [
+ "## Step 1. Copy the predefined template to your project directory.\n",
+ "In this step, we will create a working pipeline project directory and files\n",
+ "by copying files from *penguin template* in TFX. You can think of this as a\n",
+ "scaffold for your TFX pipeline project."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WoUaCyNF_YlF"
+ },
+ "source": [
+ "### Update Pip\n",
+ "\n",
+ "If we're running in Colab then we should make sure that we have the latest version of Pip. Local systems can of course be updated separately.\n",
+ "\n",
+ "Note: Updating is probably also a good idea if you are running in Vertex AI Workbench."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "iQIUEpFp_ZA2"
+ },
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "if 'google.colab' in sys.modules:\n",
+ " !pip install --upgrade pip"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VmVR97AgacoP"
+ },
+ "source": [
+ "### Install required package\n",
+ "First, install TFX and TensorFlow Model Analysis (TFMA).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XNiqq_kN0okj"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -U tfx tensorflow-model-analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hX1rqpbQ0okp"
+ },
+ "source": [
+ "Let's check the versions of TFX."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XAIoKMNG0okq"
+ },
+ "outputs": [],
+ "source": [
+ "import tensorflow as tf\n",
+ "import tensorflow_model_analysis as tfma\n",
+ "import tfx\n",
+ "\n",
+ "print('TF version: {}'.format(tf.__version__))\n",
+ "print('TFMA version: {}'.format(tfma.__version__))\n",
+ "print('TFX version: {}'.format(tfx.__version__))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TOsQbkky0ok7"
+ },
+ "source": [
+ "We are ready to create a pipeline.\n",
+ "\n",
+ "Set `PROJECT_DIR` to appropriate destination for your environment.\n",
+ "Default value is `~/imported/${PIPELINE_NAME}` which is appropriate for\n",
+ "[Google Cloud AI Platform Notebook](https://console.cloud.google.com/ai-platform/notebooks/)\n",
+ "environment.\n",
+ "\n",
+ "You may give your pipeline a different name by changing the `PIPELINE_NAME`\n",
+ "below. This will also become the name of the project directory where your\n",
+ "files will be put.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "cIPlt-700ok-"
+ },
+ "outputs": [],
+ "source": [
+ "PIPELINE_NAME=\"my_pipeline\"\n",
+ "import os\n",
+ "# Set this project directory to your new tfx pipeline project.\n",
+ "PROJECT_DIR=os.path.join(os.path.expanduser(\"~\"), \"imported\", PIPELINE_NAME)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ozHIomcd0olB"
+ },
+ "source": [
+ "### Copy template files.\n",
+ "\n",
+ "TFX includes the `penguin` template with the TFX python package.\n",
+ "`penguin` template\n",
+ "contains many instructions to bring your dataset into the pipeline which is\n",
+ "the purpose of this tutorial.\n",
+ "\n",
+ "The `tfx template copy` CLI command copies predefined template files into your\n",
+ "project directory."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VLXpTTjU0olD"
+ },
+ "outputs": [],
+ "source": [
+ "# Set `PATH` to include user python binary directory and a directory containing `skaffold`.\n",
+ "PATH=%env PATH\n",
+ "%env PATH={PATH}:/home/jupyter/.local/bin\n",
+ "\n",
+ "!tfx template copy \\\n",
+ " --pipeline-name={PIPELINE_NAME} \\\n",
+ " --destination-path={PROJECT_DIR} \\\n",
+ " --model=penguin"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yxOT19QS0olH"
+ },
+ "source": [
+ "Change the working directory context in this notebook to the project directory."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "6P-HljcU0olI"
+ },
+ "outputs": [],
+ "source": [
+ "%cd {PROJECT_DIR}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1tEYUQxH0olO"
+ },
+ "source": [
+ ">NOTE: If you are using JupyterLab or Google Cloud AI Platform Notebook,\n",
+ "don't forget to change directory in `File Browser` on the left by clicking\n",
+ "into the project directory once it is created."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IzT2PFrN0olQ"
+ },
+ "source": [
+ "### Browse your copied source files\n",
+ "\n",
+ "The TFX template provides basic scaffold files to build a pipeline,\n",
+ "including Python source code and sample data. The `penguin` template uses the\n",
+ "same *Palmer Penguins* dataset and ML model as the\n",
+ "[Penguin example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/penguin).\n",
+ "\n",
+ "Here is brief introduction to each of the Python files.\n",
+ "- `pipeline` - This directory contains the definition of the pipeline\n",
+ " - `configs.py` — defines common constants for pipeline runners\n",
+ " - `pipeline.py` — defines TFX components and a pipeline\n",
+ "- `models` - This directory contains ML model definitions\n",
+ " - `features.py`, `features_test.py` — defines features for the model\n",
+ " - `preprocessing.py`, `preprocessing_test.py` — defines preprocessing\n",
+ " routines for data\n",
+ " - `constants.py` — defines constants of the model\n",
+ " - `model.py`, `model_test.py` — defines ML model using ML frameworks\n",
+ " like TensorFlow\n",
+ "- `local_runner.py` — define a runner for local environment which uses\n",
+ "local orchestration engine\n",
+ "- `kubeflow_runner.py` — define a runner for Kubeflow Pipelines\n",
+ "orchestration engine\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "saF1CpVefaf1"
+ },
+ "source": [
+ "By default, the template only includes standard TFX components. If you need\n",
+ "some customized actions, you can create custom components for your pipeline.\n",
+ "Please see\n",
+ "[TFX custom component guide](../../../guide/understanding_custom_components)\n",
+ "for the detail."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ROwHAsDK0olT"
+ },
+ "source": [
+ "#### Unit-test files.\n",
+ "\n",
+ "You might notice that there are some files with `_test.py` in their name.\n",
+ "These are unit tests of the pipeline and it is recommended to add more unit\n",
+ "tests as you implement your own pipelines.\n",
+ "You can run unit tests by supplying the module name of test files with `-m`\n",
+ "flag. You can usually get a module name by deleting `.py` extension and\n",
+ "replacing `/` with `.`. For example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "M0cMdE2Z0olU"
+ },
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "!{sys.executable} -m models.features_test"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tO9Jhplo0olX"
+ },
+ "source": [
+ "### Create a TFX pipeline in local environment.\n",
+ "\n",
+ "TFX supports several orchestration engines to run pipelines. We will use\n",
+ "local orchestration engine. Local orchestration engine runs without any further dependencies, and it is suitable for development and debugging because it runs\n",
+ "on local environment rather than depends on remote computing clusters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZO6yR8lOo5GZ"
+ },
+ "source": [
+ "We will use `local_runner.py` to run your pipeline using local\n",
+ "orchestrator. You have to create a pipeline before running it. You can create\n",
+ "a pipeline with `pipeline create` command.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_9unbcHlo7Yi"
+ },
+ "outputs": [],
+ "source": [
+ "!tfx pipeline create --engine=local --pipeline_path=local_runner.py"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NrRL6R06o99S"
+ },
+ "source": [
+ "`pipeline create` command registers your pipeline defined in `local_runner.py`\n",
+ "without actually running it.\n",
+ "\n",
+ "You will run the created pipeline with `run create` command in following steps.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7p73589GbTFi"
+ },
+ "source": [
+ "## Step 2. Ingest YOUR data to the pipeline.\n",
+ "\n",
+ "The initial pipeline ingests the penguin dataset which is included in the\n",
+ "template. You need to put your data into the pipeline, and most TFX\n",
+ "pipelines start with ExampleGen component."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AEVwa28qtjwi"
+ },
+ "source": [
+ "### Choose an ExampleGen\n",
+ "\n",
+ "Your data can be stored anywhere your pipeline can access, on either a local or distributed filesystem, or a query-able system. TFX provides various\n",
+ "[`ExampleGen` components](../../../guide/examplegen)\n",
+ "to bring your data into a TFX pipeline. You can choose one from following\n",
+ "example generating components.\n",
+ "\n",
+ "- CsvExampleGen: Reads CSV files in a directory. Used in\n",
+ "[penguin example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/penguin)\n",
+ "and\n",
+ "[Chicago taxi example](https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline).\n",
+ "- ImportExampleGen: Takes TFRecord files with TF Example data format. Used in\n",
+ "[MNIST examples](https://github.com/tensorflow/tfx/tree/master/tfx/examples/mnist).\n",
+ "- FileBasedExampleGen for\n",
+ "[Avro](https://github.com/tensorflow/tfx/blob/master/tfx/components/example_gen/custom_executors/avro_executor.py)\n",
+ "or\n",
+ "[Parquet](https://github.com/tensorflow/tfx/blob/master/tfx/components/example_gen/custom_executors/parquet_executor.py)\n",
+ "format.\n",
+ "- [BigQueryExampleGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/extensions/google_cloud_big_query/example_gen/component/BigQueryExampleGen):\n",
+ "Reads data in Google Cloud BigQuery directly. Used in\n",
+ "[Chicago taxi examples](https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline).\n",
+ "\n",
+ "You can also create your own ExampleGen, for example, tfx includes\n",
+ "[a custom ExecampleGen which uses Presto](https://github.com/tensorflow/tfx/tree/master/tfx/examples/custom_components/presto_example_gen)\n",
+ "as a data source. See\n",
+ "[the guide](../../../guide/examplegen#custom_examplegen)\n",
+ "for more information on how to use and develop custom executors.\n",
+ "\n",
+ "Once you decide which ExampleGen to use, you will need to modify the pipeline\n",
+ "definition to use your data.\n",
+ "\n",
+ "1. Modify the `DATA_PATH` in `local_runner.py` and set it to the location of\n",
+ "your files.\n",
+ " - If you have files in local environment, specify the path. This is the\n",
+ " best option for developing or debugging a pipeline.\n",
+ " - If the files are stored in GCS, you can use a path starting with\n",
+ " `gs://{bucket_name}/...`. Please make sure that you can access GCS from\n",
+ " your terminal, for example, using\n",
+ " [`gsutil`](https://cloud.google.com/storage/docs/gsutil).\n",
+ " Please follow\n",
+ " [authorization guide in Google Cloud](https://cloud.google.com/sdk/docs/authorizing)\n",
+ " if needed.\n",
+ " - If you want to use a Query-based ExampleGen like BigQueryExampleGen, you\n",
+ " need a Query statement to select data from the data source. There are a few\n",
+ " more things you need to set to use Google Cloud BigQuery as a data source.\n",
+ " - In `pipeline/configs.py`:\n",
+ " - Change `GOOGLE_CLOUD_PROJECT` and `GCS_BUCKET_NAME` to your GCP\n",
+ " project and bucket name. The bucket should exist before we run\n",
+ " the pipeline.\n",
+ " - Uncomment `BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS` variable.\n",
+ " - Uncomment and set `BIG_QUERY_QUERY` variable to\n",
+ " **your query statement**.\n",
+ " - In `local_runner.py`:\n",
+ " - Comment out `data_path` argument and uncomment `query` argument\n",
+ " instead in `pipeline.create_pipeline()`.\n",
+ " - In `pipeline/pipeline.py`:\n",
+ " - Comment out `data_path` argument and uncomment `query` argument\n",
+ " in `create_pipeline()`.\n",
+ " - Use\n",
+ " [BigQueryExampleGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/extensions/google_cloud_big_query/example_gen/component/BigQueryExampleGen)\n",
+ " instead of CsvExampleGen.\n",
+ "\n",
+ "1. Replace existing CsvExampleGen to your ExampleGen class in\n",
+ "`pipeline/pipeline.py`. Each ExampleGen class has different signature.\n",
+ "Please see [ExampleGen component guide](../../../guide/examplegen) for more detail. Don't forget to import required modules with\n",
+ "`import` statements in `pipeline/pipeline.py`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xiAG6acjbl5U"
+ },
+ "source": [
+ "The initial pipeline is consist of four components, `ExampleGen`,\n",
+ "`StatisticsGen`, `SchemaGen` and `ExampleValidator`. We don't need to change\n",
+ "anything for `StatisticsGen`, `SchemaGen` and `ExampleValidator`. Let's run the\n",
+ "pipeline for the first time."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tKOI48WumF7h"
+ },
+ "outputs": [],
+ "source": [
+ "# Update and run the pipeline.\n",
+ "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
+ " && tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eFbYFNVbbzna"
+ },
+ "source": [
+ "You should see \"Component ExampleValidator is finished.\" if the pipeline ran successfully."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uuD5FRPAcOn8"
+ },
+ "source": [
+ "### Examine output of the pipeline."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tL1wWoDh5wkj"
+ },
+ "source": [
+ "TFX pipeline produces two kinds of output, artifacts and a\n",
+ "[metadata DB(MLMD)](../../../guide/mlmd) which contains\n",
+ "metadata of artifacts and pipeline executions. The location to the output is\n",
+ "defined in `local_runner.py`. By default, artifacts are stored under\n",
+ "`tfx_pipeline_output` directory and metadata is stored as an sqlite database\n",
+ "under `tfx_metadata` directory.\n",
+ "\n",
+ "You can use MLMD APIs to examine these outputs. First, we will define some\n",
+ "utility functions to search output artifacts that were just produced."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "K0i_jTvOI8mv"
+ },
+ "outputs": [],
+ "source": [
+ "import tensorflow as tf\n",
+ "import tfx\n",
+ "from ml_metadata import errors\n",
+ "from ml_metadata.proto import metadata_store_pb2\n",
+ "from tfx.types import artifact_utils\n",
+ "\n",
+ "# TODO(b/171447278): Move these functions into TFX library.\n",
+ "\n",
+ "def get_latest_executions(store, pipeline_name, component_id = None):\n",
+ " \"\"\"Fetch all pipeline runs.\"\"\"\n",
+ " if component_id is None: # Find entire pipeline runs.\n",
+ " run_contexts = [\n",
+ " c for c in store.get_contexts_by_type('run')\n",
+ " if c.properties['pipeline_name'].string_value == pipeline_name\n",
+ " ]\n",
+ " else: # Find specific component runs.\n",
+ " run_contexts = [\n",
+ " c for c in store.get_contexts_by_type('component_run')\n",
+ " if c.properties['pipeline_name'].string_value == pipeline_name and\n",
+ " c.properties['component_id'].string_value == component_id\n",
+ " ]\n",
+ " if not run_contexts:\n",
+ " return []\n",
+ " # Pick the latest run context.\n",
+ " latest_context = max(run_contexts,\n",
+ " key=lambda c: c.last_update_time_since_epoch)\n",
+ " return store.get_executions_by_context(latest_context.id)\n",
+ "\n",
+ "def get_latest_artifacts(store, pipeline_name, component_id = None):\n",
+ " \"\"\"Fetch all artifacts from latest pipeline execution.\"\"\"\n",
+ " executions = get_latest_executions(store, pipeline_name, component_id)\n",
+ "\n",
+ " # Fetch all artifacts produced from the given executions.\n",
+ " execution_ids = [e.id for e in executions]\n",
+ " events = store.get_events_by_execution_ids(execution_ids)\n",
+ " artifact_ids = [\n",
+ " event.artifact_id for event in events\n",
+ " if event.type == metadata_store_pb2.Event.OUTPUT\n",
+ " ]\n",
+ " return store.get_artifacts_by_id(artifact_ids)\n",
+ "\n",
+ "def find_latest_artifacts_by_type(store, artifacts, artifact_type):\n",
+ " \"\"\"Get the latest artifacts of a specified type.\"\"\"\n",
+ " # Get type information from MLMD\n",
+ " try:\n",
+ " artifact_type = store.get_artifact_type(artifact_type)\n",
+ " except errors.NotFoundError:\n",
+ " return []\n",
+ " # Filter artifacts with type.\n",
+ " filtered_artifacts = [aritfact for aritfact in artifacts\n",
+ " if aritfact.type_id == artifact_type.id]\n",
+ " # Convert MLMD artifact data into TFX Artifact instances.\n",
+ " return [artifact_utils.deserialize_artifact(artifact_type, artifact)\n",
+ " for artifact in filtered_artifacts]\n",
+ "\n",
+ "\n",
+ "from tfx.orchestration.experimental.interactive import visualizations\n",
+ "\n",
+ "def visualize_artifacts(artifacts):\n",
+ " \"\"\"Visualizes artifacts using standard visualization modules.\"\"\"\n",
+ " for artifact in artifacts:\n",
+ " visualization = visualizations.get_registry().get_visualization(\n",
+ " artifact.type_name)\n",
+ " if visualization:\n",
+ " visualization.display(artifact)\n",
+ "\n",
+ "from tfx.orchestration.experimental.interactive import standard_visualizations\n",
+ "standard_visualizations.register_standard_visualizations()\n",
+ "\n",
+ "import pprint\n",
+ "\n",
+ "from tfx.orchestration import metadata\n",
+ "from tfx.types import artifact_utils\n",
+ "from tfx.types import standard_artifacts\n",
+ "\n",
+ "def preview_examples(artifacts):\n",
+ " \"\"\"Preview a few records from Examples artifacts.\"\"\"\n",
+ " pp = pprint.PrettyPrinter()\n",
+ " for artifact in artifacts:\n",
+ " print(\"==== Examples artifact:{}({})\".format(artifact.name, artifact.uri))\n",
+ " for split in artifact_utils.decode_split_names(artifact.split_names):\n",
+ " print(\"==== Reading from split:{}\".format(split))\n",
+ " split_uri = artifact_utils.get_split_uri([artifact], split)\n",
+ "\n",
+ " # Get the list of files in this directory (all compressed TFRecord files)\n",
+ " tfrecord_filenames = [os.path.join(split_uri, name)\n",
+ " for name in os.listdir(split_uri)]\n",
+ " # Create a `TFRecordDataset` to read these files\n",
+ " dataset = tf.data.TFRecordDataset(tfrecord_filenames,\n",
+ " compression_type=\"GZIP\")\n",
+ " # Iterate over the first 2 records and decode them.\n",
+ " for tfrecord in dataset.take(2):\n",
+ " serialized_example = tfrecord.numpy()\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(serialized_example)\n",
+ " pp.pprint(example)\n",
+ "\n",
+ "import local_runner\n",
+ "\n",
+ "metadata_connection_config = metadata.sqlite_metadata_connection_config(\n",
+ " local_runner.METADATA_PATH)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cmwor9nVcmxy"
+ },
+ "source": [
+ "Now we can read metadata of output artifacts from MLMD."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "TtsrZEUB1-J4"
+ },
+ "outputs": [],
+ "source": [
+ "with metadata.Metadata(metadata_connection_config) as metadata_handler:\n",
+ " # Search all artifact from the previous pipeline run.\n",
+ " artifacts = get_latest_artifacts(metadata_handler.store, PIPELINE_NAME)\n",
+ " # Find artifacts of Examples type.\n",
+ " examples_artifacts = find_latest_artifacts_by_type(\n",
+ " metadata_handler.store, artifacts,\n",
+ " standard_artifacts.Examples.TYPE_NAME)\n",
+ " # Find artifacts generated from StatisticsGen.\n",
+ " stats_artifacts = find_latest_artifacts_by_type(\n",
+ " metadata_handler.store, artifacts,\n",
+ " standard_artifacts.ExampleStatistics.TYPE_NAME)\n",
+ " # Find artifacts generated from SchemaGen.\n",
+ " schema_artifacts = find_latest_artifacts_by_type(\n",
+ " metadata_handler.store, artifacts,\n",
+ " standard_artifacts.Schema.TYPE_NAME)\n",
+ " # Find artifacts generated from ExampleValidator.\n",
+ " anomalies_artifacts = find_latest_artifacts_by_type(\n",
+ " metadata_handler.store, artifacts,\n",
+ " standard_artifacts.ExampleAnomalies.TYPE_NAME)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3U5MNAUIdBtN"
+ },
+ "source": [
+ "Now we can examine outputs from each component.\n",
+ "[Tensorflow Data Validation(TFDV)](https://www.tensorflow.org/tfx/data_validation/get_started)\n",
+ "is used in `StatisticsGen`, `SchemaGen` and `ExampleValidator`, and TFDV can\n",
+ "be used to visualize outputs from these components.\n",
+ "\n",
+ "In this tutorial, we will use visualization helper methods in TFX which use TFDV\n",
+ "internally to show the visualization. Please see\n",
+ "[TFX components tutorial](/tutorials/tfx/components_keras)\n",
+ "to learn more about each component."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AxS6FsgU2IoZ"
+ },
+ "source": [
+ "#### Examine output form ExampleGen\n",
+ "\n",
+ "Let's examine output from ExampleGen. Take a look at the first two examples for\n",
+ "each split:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "3NWzXEcE13tW"
+ },
+ "outputs": [],
+ "source": [
+ "preview_examples(examples_artifacts)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Q0uiuPhkGEBz"
+ },
+ "source": [
+ "By default, TFX ExampleGen divides examples into two splits, *train* and\n",
+ "*eval*, but you can\n",
+ "[adjust your split configuration](../../../guide/examplegen#span_version_and_split)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yVh13wJu-IRv"
+ },
+ "source": [
+ "#### Examine output from StatisticsGen\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "9LipxUp7-IRw",
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "visualize_artifacts(stats_artifacts)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8aebEY4c0Ju7"
+ },
+ "source": [
+ "These statistics are supplied to SchemaGen to construct a schema of data\n",
+ "automatically."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ExEKbdw8-IRx"
+ },
+ "source": [
+ "#### Examine output from SchemaGen\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "M2IURBSp-IRy"
+ },
+ "outputs": [],
+ "source": [
+ "visualize_artifacts(schema_artifacts)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oTvj8yeBHDdU"
+ },
+ "source": [
+ "This schema is automatically inferred from the output of StatisticsGen.\n",
+ "We will use this generated schema in this tutorial, but you also can\n",
+ "[modify and customize the schema](../../../guide/statsgen#creating_a_curated_schema)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rl1PEUgo-IRz"
+ },
+ "source": [
+ "#### Examine output from ExampleValidator\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "F-4oAjGR-IR0",
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "visualize_artifacts(anomalies_artifacts)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t026ZzbU0961"
+ },
+ "source": [
+ "If any anomalies were found, you may review your data that all examples\n",
+ "follow your assumptions. Outputs from other components like StatistcsGen might\n",
+ "be useful. Found anomalies don't block the pipeline execution."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lFMmqy1W-IR1"
+ },
+ "source": [
+ "You can see the available features from the outputs of the `SchemaGen`. If\n",
+ "your features can be used to construct ML model in `Trainer` directly, you can\n",
+ "skip the next step and go to Step 4. Otherwise you can do some feature\n",
+ "engineering work in the next step. `Transform` component is needed when\n",
+ "full-pass operations like calculating averages are required, especially when\n",
+ "you need to scale."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bYH8Y2KB0olm"
+ },
+ "source": [
+ "## Step 3. (Optional) Feature engineering with Transform component.\n",
+ "\n",
+ "In this step, you will define various feature engineering job which will be\n",
+ "used by `Transform` component in the pipeline. See\n",
+ "[Transform component guide](../../../guide/transform)\n",
+ "for more information.\n",
+ "\n",
+ "This is only necessary if you training code requires additional feature(s)\n",
+ "which is not available in the output of ExampleGen. Otherwise, feel free to\n",
+ "fast forward to next step of using Trainer."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qm_JjQUydbbb"
+ },
+ "source": [
+ "### Define features of the model\n",
+ "\n",
+ "`models/features.py` contains constants to define features for the model\n",
+ "including feature names, size of vocabulariy and so on. By default `penguin`\n",
+ "template has two costants, `FEATURE_KEYS` and `LABEL_KEY`, because our `penguin`\n",
+ "model solves a classification problem using supervised learning and all\n",
+ "features are continuous numeric features. See\n",
+ "[feature definitions from the chicago taxi example](https://github.com/tensorflow/tfx/blob/master/tfx/experimental/templates/taxi/models/features.py)\n",
+ "for another example.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ATUeHXvJdcBn"
+ },
+ "source": [
+ "### Implement preprocessing for training / serving in preprocessing_fn().\n",
+ "\n",
+ "Actual feature engineering happens in `preprocessing_fn()` function in\n",
+ "`models/preprocessing.py`.\n",
+ "\n",
+ "In `preprocessing_fn` you can define a series of functions that manipulate the\n",
+ "input dict of tensors to produce the output dict of tensors. There are helper\n",
+ "functions like `scale_to_0_1` and `compute_and_apply_vocabulary` in the\n",
+ "TensorFlow Transform API or you can simply use regular TensorFlow functions.\n",
+ "By default `penguin` template includes example usages of\n",
+ "[tft.scale_to_z_score](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/scale_to_z_score)\n",
+ "function to normalize feature values.\n",
+ "\n",
+ "See [Tensflow Transform guide](https://www.tensorflow.org/tfx/transform/get_started)\n",
+ "for more information about authoring `preprocessing_fn`.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xUg_Lc43dbTp"
+ },
+ "source": [
+ "### Add Transform component to the pipeline.\n",
+ "\n",
+ "If your preprocessing_fn is ready, add `Transform` component to the pipeline.\n",
+ "\n",
+ "1. In `pipeline/pipeline.py` file, uncomment `# components.append(transform)`\n",
+ "to add the component to the pipeline.\n",
+ "\n",
+ "You can update the pipeline and run again."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VE-Pqvto0olm"
+ },
+ "outputs": [],
+ "source": [
+ "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
+ " && tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8q1ZYEHX0olo"
+ },
+ "source": [
+ "If the pipeline ran successfully, you should see \"Component Transform is\n",
+ "finished.\" *somewhere* in the log. Because `Transform` component and\n",
+ "`ExampleValidator` component are not dependent to each other, the order of\n",
+ "executions is not fixed. That said, either of `Transform` and\n",
+ "`ExampleValidator` can be the last component in the pipeline execution."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XrPEnZt0E_0m"
+ },
+ "source": [
+ "### Examine output from Transform\n",
+ "\n",
+ "Transform component creates two kinds of outputs, a Tensorflow graph and\n",
+ "transformed examples. The transformed examples are Examples artifact type which\n",
+ "is also produced by ExampleGen, but this one contains transformed feature\n",
+ "values instead.\n",
+ "\n",
+ "You can examine them as we did in the previous step."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "FvC5S66ZU5g6"
+ },
+ "outputs": [],
+ "source": [
+ "with metadata.Metadata(metadata_connection_config) as metadata_handler:\n",
+ " # Search all artifact from the previous run of Transform component.\n",
+ " artifacts = get_latest_artifacts(metadata_handler.store,\n",
+ " PIPELINE_NAME, \"Transform\")\n",
+ " # Find artifacts of Examples type.\n",
+ " transformed_examples_artifacts = find_latest_artifacts_by_type(\n",
+ " metadata_handler.store, artifacts,\n",
+ " standard_artifacts.Examples.TYPE_NAME)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "CAFiEPKuC6Ib"
+ },
+ "outputs": [],
+ "source": [
+ "preview_examples(transformed_examples_artifacts)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dWMBXU510olp"
+ },
+ "source": [
+ "## Step 4. Train your model with Trainer component.\n",
+ "\n",
+ "We will build a ML model using `Trainer` component. See\n",
+ "[Trainer component guide](../../../guide/trainer)\n",
+ "for more information. You need to provide your model code to the Trainer\n",
+ "component.\n",
+ "\n",
+ "### Define your model.\n",
+ "\n",
+ "In penguin template, `models.model.run_fn` is used as `run_fn` argument for\n",
+ "`Trainer` component. It means that `run_fn()` function in `models/model.py`\n",
+ "will be called when `Trainer` component runs. You can see the code to construct\n",
+ "a simple DNN model using `keras` API in given code. See\n",
+ "[TensorFlow 2.x in TFX](../../../guide/keras)\n",
+ "guide for more information about using keras API in TFX.\n",
+ "\n",
+ "In this `run_fn`, you should build a model and save it to a directory pointed\n",
+ "by `fn_args.serving_model_dir` which is specified by the component. You can use\n",
+ "other arguments in `fn_args` which is passed into the `run_fn`. See\n",
+ "[related codes](https://github.com/tensorflow/tfx/blob/b01482442891a49a1487c67047e85ab971717b75/tfx/components/trainer/executor.py#L141)\n",
+ "for the full list of arguments in `fn_args`.\n",
+ "\n",
+ "Define your features in `models/features.py` and use them as needed. If you\n",
+ "have transformed your features in Step 3, you should use transformed features\n",
+ "as inputs to your model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oFiLIaCm-IR4"
+ },
+ "source": [
+ "### Add Trainer component to the pipeline.\n",
+ "\n",
+ "If your run_fn is ready, add `Trainer` component to the pipeline.\n",
+ "\n",
+ "1. In `pipeline/pipeline.py` file, uncomment `# components.append(trainer)`\n",
+ "to add the component to the pipeline.\n",
+ "\n",
+ "Arguments for the trainer component might depends on whether you use Transform\n",
+ "component or not.\n",
+ "- If you do **NOT** use `Transform` component, you don't need to change the\n",
+ "arguments.\n",
+ "- If you use `Transform` component, you need to change arguments\n",
+ "when creating a `Trainer` component instance.\n",
+ " - Change `examples` argument to\n",
+ " `examples=transform.outputs['transformed_examples'],`. We need to use\n",
+ " transformed examples for training.\n",
+ " - Add `transform_graph` argument like\n",
+ " `transform_graph=transform.outputs['transform_graph'],`. This graph\n",
+ " contains TensorFlow graph for the transform operations.\n",
+ " - After above changes, the code for Trainer component creation will\n",
+ " look like following.\n",
+ "\n",
+ " ```python\n",
+ " # If you use a Transform component.\n",
+ " trainer = Trainer(\n",
+ " run_fn=run_fn,\n",
+ " examples=transform.outputs['transformed_examples'],\n",
+ " transform_graph=transform.outputs['transform_graph'],\n",
+ " schema=schema_gen.outputs['schema'],\n",
+ " ...\n",
+ " ```\n",
+ "\n",
+ "You can update the pipeline and run again."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VQDNitkH0olq"
+ },
+ "outputs": [],
+ "source": [
+ "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
+ " && tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ksWfVQUnMYCX"
+ },
+ "source": [
+ "When this execution runs successfully, you have now created and run your first\n",
+ "TFX pipeline for your model. Congratulations!\n",
+ "\n",
+ "Your new model will be located in some place under the output directory, but it\n",
+ "would be better to have a model in fixed location or service outside of the TFX\n",
+ "pipeline which holds many interim results. Even better with continuous\n",
+ "evaluation of the built model which is critical in ML production systems. We\n",
+ "will see how continuous evaluation and deployments work in TFX in the next step."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4DRTFdTy0ol3"
+ },
+ "source": [
+ "## Step 5. (Optional) Evaluate the model with Evaluator and publish with pusher.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5DID2nzH-IR7"
+ },
+ "source": [
+ "[`Evaluator`](../../../guide/evaluator) component\n",
+ "continuously evaluate every built model from `Trainer`, and\n",
+ "[`Pusher`](../../../guide/pusher) copies the model to\n",
+ "a predefined location in the file system or even to\n",
+ "[Google Cloud AI Platform Models](https://console.cloud.google.com/ai-platform/models).\n",
+ "\n",
+ "### Adds Evaluator component to the pipeline.\n",
+ "\n",
+ "In `pipeline/pipeline.py` file:\n",
+ "1. Uncomment `# components.append(model_resolver)` to add latest model resolver\n",
+ "to the pipeline. Evaluator can be used to compare a model with old baseline\n",
+ "model which passed Evaluator in last pipeline run. `LatestBlessedModelResolver`\n",
+ "finds the latest model which passed Evaluator.\n",
+ "1. Set proper `tfma.MetricsSpec` for your model. Evaluation might be different\n",
+ "for every ML model. In the penguin template, `SparseCategoricalAccuracy` was used\n",
+ "because we are solving a multi category classification problem. You also need\n",
+ "to specify `tfma.SliceSpec` to analyze your model for specific slices. For more\n",
+ "detail, see\n",
+ "[Evaluator component guide](../../../guide/evaluator).\n",
+ "1. Uncomment `# components.append(evaluator)` to add the component to the\n",
+ "pipeline.\n",
+ "\n",
+ "You can update the pipeline and run again."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "i5_ojoZZmaDQ"
+ },
+ "outputs": [],
+ "source": [
+ "# Update and run the pipeline.\n",
+ "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
+ " && tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "apZX74qJ-IR7"
+ },
+ "source": [
+ "### Examine output of Evaluator\n",
+ "This step requires TensorFlow Model Analysis(TFMA) Jupyter notebook extension.\n",
+ "Note that the version of the TFMA notebook extension should be identical to the\n",
+ "version of TFMA python package.\n",
+ "\n",
+ "Following command will install TFMA notebook extension from NPM registry. It\n",
+ "might take several minutes to complete."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VoL46D5Pw5FX"
+ },
+ "outputs": [],
+ "source": [
+ "# Install TFMA notebook extension.\n",
+ "!jupyter labextension install tensorflow_model_analysis@{tfma.__version__}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GKMo4j8ww5PB"
+ },
+ "source": [
+ "If installation is completed, please **reload your browser** to make the\n",
+ "extension take effect."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2ztotdqS-IR8"
+ },
+ "outputs": [],
+ "source": [
+ "with metadata.Metadata(metadata_connection_config) as metadata_handler:\n",
+ " # Search all aritfacts from the previous pipeline run.\n",
+ " artifacts = get_latest_artifacts(metadata_handler.store, PIPELINE_NAME)\n",
+ " model_evaluation_artifacts = find_latest_artifacts_by_type(\n",
+ " metadata_handler.store, artifacts,\n",
+ " standard_artifacts.ModelEvaluation.TYPE_NAME)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "tVojwMCuDJuk"
+ },
+ "outputs": [],
+ "source": [
+ "if model_evaluation_artifacts:\n",
+ " tfma_result = tfma.load_eval_result(model_evaluation_artifacts[0].uri)\n",
+ " tfma.view.render_slicing_metrics(tfma_result)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "18tqjyHN-IR9"
+ },
+ "source": [
+ "### Adds Pusher component to the pipeline.\n",
+ "\n",
+ "If the model looks promising, we need to publish the model.\n",
+ "[Pusher component](../../../guide/pusher)\n",
+ "can publish the model to a location in the filesystem or to GCP AI Platform\n",
+ "Models using\n",
+ "[a custom executor](https://github.com/tensorflow/tfx/blob/master/tfx/extensions/google_cloud_ai_platform/pusher/executor.py).\n",
+ "\n",
+ "`Evaluator` component continuously evaluate every built model from `Trainer`,\n",
+ "and [`Pusher`](../../../guide/pusher) copies the model to\n",
+ "a predefined location in the file system or even to\n",
+ "[Google Cloud AI Platform Models](https://console.cloud.google.com/ai-platform/models).\n",
+ "\n",
+ "1. In `local_runner.py`, set `SERVING_MODEL_DIR` to a directory to publish.\n",
+ "1. In `pipeline/pipeline.py` file, uncomment `# components.append(pusher)`\n",
+ "to add Pusher to the pipeline.\n",
+ "\n",
+ "You can update the pipeline and run again."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "QH81d9FsrSXS"
+ },
+ "outputs": [],
+ "source": [
+ "# Update and run the pipeline.\n",
+ "!tfx pipeline update --engine=local --pipeline_path=local_runner.py \\\n",
+ " && tfx run create --engine=local --pipeline_name={PIPELINE_NAME}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_K6Z18tC-IR-"
+ },
+ "source": [
+ "You should be able to find your new model at `SERVING_MODEL_DIR`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "20KRGsPX0ol3"
+ },
+ "source": [
+ "## Step 6. (Optional) Deploy your pipeline to Kubeflow Pipelines on GCP.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0X6vfy7s-IR-"
+ },
+ "source": [
+ "As mentioned earlier, `local_runner.py` is good for debugging or development\n",
+ "purpose but not a best solution for production workloads. In this step, we will\n",
+ "deploy the pipeline to Kubeflow Pipelines on Google Cloud.\n",
+ "\n",
+ "### Preparation\n",
+ "We need `kfp` python package and `skaffold` program to deploy a pipeline to a\n",
+ "Kubeflow Pipelines cluster."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Ge1bMUtU-IR_"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install --upgrade -q kfp\n",
+ "\n",
+ "# Download skaffold and set it executable.\n",
+ "!curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && chmod +x skaffold"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZsbnGq52-ISB"
+ },
+ "source": [
+ "You need to move `skaffold` binary to the place where your shell can find it.\n",
+ "Or you can specify the path to skaffold when you run `tfx` binary with\n",
+ "`--skaffold-cmd` flag."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4amQ0Elz-ISC"
+ },
+ "outputs": [],
+ "source": [
+ "# Move skaffold binary into your path\n",
+ "!mv skaffold /home/jupyter/.local/bin/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1rmyns-o-ISD"
+ },
+ "source": [
+ "You also need a Kubeflow Pipelines cluster to run the pipeline. Please\n",
+ "follow Step 1 and 2 in\n",
+ "[TFX on Cloud AI Platform Pipelines tutorial](/tutorials/tfx/cloud-ai-platform-pipelines).\n",
+ "\n",
+ "When your cluster is ready, open the pipeline dashboard by clicking\n",
+ "*Open Pipelines Dashboard* in the\n",
+ "[`Pipelines` page of the Google cloud console](http://console.cloud.google.com/ai-platform/pipelines).\n",
+ "The URL of this page is `ENDPOINT` to request a pipeline run. The endpoint\n",
+ "value is everything in the URL after the https://, up to, and including,\n",
+ "googleusercontent.com. Put your endpoint to following code block.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "hGyj-Qa3-ISD"
+ },
+ "outputs": [],
+ "source": [
+ "ENDPOINT='' # Enter your ENDPOINT here."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "igTo05YI-ISF"
+ },
+ "source": [
+ "To run our code in a Kubeflow Pipelines cluster, we need to pack our code into\n",
+ "a container image. The image will be built automatically while deploying our\n",
+ "pipeline, and you only need to set a name and an container registry for your\n",
+ "image. In our example, we will use\n",
+ "[Google Container registry](https://cloud.google.com/container-registry),\n",
+ "and name it `tfx-pipeline`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "5J3LrI0K-ISF"
+ },
+ "outputs": [],
+ "source": [
+ "# Read GCP project id from env.\n",
+ "shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null\n",
+ "GOOGLE_CLOUD_PROJECT=shell_output[0]\n",
+ "\n",
+ "# Docker image name for the pipeline image.\n",
+ "CUSTOM_TFX_IMAGE='gcr.io/' + GOOGLE_CLOUD_PROJECT + '/tfx-pipeline'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Gg11pLmU-ISH"
+ },
+ "source": [
+ "### Set data location.\n",
+ "\n",
+ "Your data should be accessible from the Kubeflow Pipelines cluster. If you have\n",
+ "used data in your local environment, you might need to upload it to remote\n",
+ "storage like Google Cloud Storage. For example, we can upload penguin data to a\n",
+ "default bucket which is created automatically when a Kubeflow Pipelines cluster\n",
+ "is deployed like following."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "y8MmRIHi-ISH"
+ },
+ "outputs": [],
+ "source": [
+ "!gsutil cp data/data.csv gs://{GOOGLE_CLOUD_PROJECT}-kubeflowpipelines-default/tfx-template/data/penguin/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ASc1tDMm-ISJ"
+ },
+ "source": [
+ "Update the data location stored at `DATA_PATH` in `kubeflow_runner.py`.\n",
+ "\n",
+ "If you are using BigQueryExampleGen, there is no need to upload the data file,\n",
+ "but please make sure that `kubeflow_runner.py` uses the same `query` and\n",
+ "`beam_pipeline_args` argument for `pipeline.create_pipeline()` function."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q42gn3XS-ISK"
+ },
+ "source": [
+ "### Deploy the pipeline.\n",
+ "\n",
+ "If everything is ready, you can create a pipeline using `tfx pipeline create`\n",
+ "command.\n",
+ "> Note: When creating a pipeline for Kubeflow Pipelines, we need a container\n",
+ "image which will be used to run our pipeline. And `skaffold` will build the\n",
+ "image for us. Because `skaffold` pulls base images from the docker hub, it will\n",
+ "take 5~10 minutes when we build the image for the first time, but it will take\n",
+ "much less time from the second build.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ytZ0liBn-ISK"
+ },
+ "outputs": [],
+ "source": [
+ "!tfx pipeline create \\\n",
+ "--engine=kubeflow \\\n",
+ "--pipeline-path=kubeflow_runner.py \\\n",
+ "--endpoint={ENDPOINT} \\\n",
+ "--build-target-image={CUSTOM_TFX_IMAGE}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fFqUQxQG-ISM"
+ },
+ "source": [
+ "Now start an execution run with the newly created pipeline using the\n",
+ "`tfx run create` command."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "4ps-4RHz-ISM"
+ },
+ "outputs": [],
+ "source": [
+ "!tfx run create --engine=kubeflow --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Fx3LtAL0-ISN"
+ },
+ "source": [
+ "Or, you can also run the pipeline in the Kubeflow Pipelines dashboard. The new\n",
+ "run will be listed under `Experiments` in the Kubeflow Pipelines dashboard.\n",
+ "Clicking into the experiment will allow you to monitor progress and visualize\n",
+ "the artifacts created during the execution run."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mP8W6zjD-ISO"
+ },
+ "source": [
+ "If you are interested in running your pipeline on Kubeflow Pipelines,\n",
+ "find more instructions in\n",
+ "[TFX on Cloud AI Platform Pipelines tutorial](/tutorials/tfx/cloud-ai-platform-pipelines)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PTsgD_Kz-ISO"
+ },
+ "source": [
+ "### Cleaning up\n",
+ "\n",
+ "To clean up all Google Cloud resources used in this step, you can\n",
+ "[delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)\n",
+ "you used for the tutorial.\n",
+ "\n",
+ "Alternatively, you can clean up individual resources by visiting each\n",
+ "consoles:\n",
+ "- [Google Cloud Storage](https://console.cloud.google.com/storage)\n",
+ "- [Google Container Registry](https://console.cloud.google.com/gcr)\n",
+ "- [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes)"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "collapsed_sections": [
+ "DjUA6S30k52h"
+ ],
+ "name": "penguin_template.ipynb",
+ "provenance": [],
+ "toc_visible": true
+ },
+ "environment": {
+ "name": "tf2-gpu.2-1.m46",
+ "type": "gcloud",
+ "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m46"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
}