Creating a complete TensorFlow 2 workflow in Amazon SageMaker

Managing the complete lifecycle of a deep learning project can be challenging, especially if you use multiple separate tools and services. For example, you may use different tools for data preprocessing, prototyping training and inference code, full-scale model training and tuning, model deployments, and workflow automation to orchestrate all of the above for production. Friction caused by switching tools can slow down projects and increase costs. This post shows how to efficiently manage the complete lifecycle of deep learning projects with Amazon SageMaker. TensorFlow 2 is the framework used in example code, although the concepts described are generally applicable to other frameworks as well.

This post also has an associated sample notebook, which you can run in less than an hour to demonstrate all of the features discussed here. For more information, see the GitHub repo.

Overview of the Amazon SageMaker workflow

Every data science project using TensorFlow 2 or another framework begins with a dataset: obtaining, exploring, and preprocessing it. In the context of an Amazon SageMaker workflow, data exploration typically occurs within notebooks. These notebooks are preferably on relatively small, less powerful, and inexpensive instance types because they typically run most of the workday.

Accordingly, unless the dataset is relatively small, a notebook isn’t the best place to perform full-scale data processing, model training, and inference. Because these tasks typically require substantial parallel computing resources, a notebook isn’t feasible for performing them. Instead, it’s much more practical and cost-effective to use Amazon SageMaker’s functionality for spinning up separate clusters of right-sized, more powerful instances that can complete these tasks promptly. All of these charges are billed by the second, and at job completion, Amazon SageMaker automatically shuts down the instances. As a result, in a typical Amazon SageMaker workflow, the most frequent charges are only for relatively inexpensive notebooks for data exploration and prototyping, rather than for more powerful and expensive GPU and accelerated compute instances.

When prototyping is complete, you can move beyond notebooks with workflow automation. An automated pipeline is necessary for orchestrating the complete workflow through model deployment in a robust and repeatable way. Amazon SageMaker provides a native solution for this as well. The following sections of this post introduce various features of Amazon SageMaker that you can use to implement these project lifecycle stages.

Data transformation with Amazon SageMaker Processing

Amazon SageMaker Processing helps you preprocess large datasets in a right-sized, managed cluster separate from notebooks. Amazon SageMaker Processing includes off-the-shelf support for Scikit-learn, and supports any other technology that is containerized. For example, you can launch transient Apache Spark clusters for feature transformations within Amazon SageMaker Processing.

To use Amazon SageMaker Processing with Scikit-learn, supply a Python data preprocessing script with standard Scikit-learn code. There is only a minimal contract for the script: input and output data must be placed in specified locations. Amazon SageMaker Processing automatically loads the input data from Amazon Simple Storage Service (Amazon S3) and uploads transformed data back to Amazon S3 when the job is complete.

Before starting an Amazon SageMaker Processing job, instantiate a SKLearnProcessor object as shown in the following code example. In this object, specify the instance type to use in the job and the number of instances.

from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=get_execution_role(),
                                     instance_type='ml.m5.xlarge',
                                     instance_count=2)

To distribute the data files equally among the cluster instances for processing, specify the ShardedByS3Key distribution type in the ProcessingInput object. This makes sure that if there are n instances, each instance receives 1/n files from the specified S3 bucket. The ability to easily create a large cluster of instances for stateless data transformations is just one of the many benefits Amazon SageMaker Processing provides.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime 

processing_job_name = "tf-2-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}/data'.format(bucket, s3_prefix)

sklearn_processor.run(code='preprocessing.py',
                      job_name=processing_job_name,
                      inputs=[ProcessingInput(
                        source=raw_s3,
                        destination='/opt/ml/processing/input',
                        s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[ProcessingOutput(output_name='train',
                                                destination='{}/train'.format(output_destination),
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test',
                                                destination='{}/test'.format(output_destination),
                                                source='/opt/ml/processing/test')])

Prototyping training and inference code with local mode

When the dataset is ready for training, the next step is to prototype the training code. For TensorFlow 2, the most convenient workflow is to provide a training script for ingestion by the Amazon SageMaker prebuilt TensorFlow 2 container. This feature is named script mode, and works seamlessly with the Amazon SageMaker local mode training feature.

Local mode is a convenient way to make sure code is working locally on a notebook before moving to full-scale, hosted training in a separate right-sized cluster that Amazon SageMaker manages. In local mode, you typically train for a short time for just a few epochs, possibly on only a sample of the full dataset, to confirm the code is working properly and avoid wasting full-scale training time. Also, specify the instance type as either local_gpu or local, depending on whether the notebook is on a GPU or CPU instance.

from sagemaker.tensorflow import TensorFlow

git_config = {'repo': 'https://github.com/aws-samples/amazon-sagemaker-script-mode', 
              'branch': 'master'}

model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(git_config=git_config,
                             source_dir='tf-2-workflow/train_model',
                             entry_point='train.py',
                             model_dir=model_dir,
                             train_instance_type=train_instance_type,
                             train_instance_count=1,
                             hyperparameters=hyperparameters,
                             role=sagemaker.get_execution_role(),
                             base_job_name='tf-2-workflow',
                             framework_version='2.1',
                             py_version='py3',
                             script_mode=True)

Although local mode training is very useful to make sure training code is working before moving on to full-scale training, it’s also convenient to have an easy way to prototype inference code locally. One possibility is to fetch a TensorFlow SavedModel artifact or a model checkpoint saved in Amazon S3 and load it in a notebook for testing. However, an easier way to do this is to use local mode endpoints.

You can deploy a model in a local mode endpoint, which contains an Amazon SageMaker TensorFlow Serving container, by using the estimator object from the local mode training job. With one exception, this code is the same as the code for deploying a model to a separate hosted endpoint. Just invoke the local estimator’s deploy method, and again specify the instance type as either local_gpu or local, depending on whether the notebook is on a GPU or CPU instance.

local_predictor = local_estimator.deploy(initial_instance_count=1, instance_type='local')
local_results = local_predictor.predict(x_test[:10])['predictions']

Before using local mode, make sure that docker-compose or nvidia-docker-compose (for GPU instances) can be run on your instance. The GitHub repo for this blog post has a script you can use for this purpose.

Automatic Model Tuning

After prototyping is complete, the next step is to use Amazon SageMaker hosted training and automatic model tuning. Hosted training is preferred for doing full-scale training, especially large-scale, distributed training. Unlike local mode, for hosted training the actual training occurs not on the notebook itself, but on a separate cluster of machines that Amazon SageMaker manages. An estimator object for hosted training is similar to a local mode estimator, except for the following:

The dataset must be in Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Lustre
The training instance type is set to an Amazon SageMaker ML instance type right-sized for full-scale training

Also, because local mode prototyping proved the training code is working, you can modify the hosted training estimator to train for a larger number of epochs, and on the full dataset if you just used a sample in local mode.

However, running individual hosted training jobs and manually tweaking hyperparameters in search of the best model is likely to be a daunting, time-consuming task. Selecting the right combination of hyperparameters depends on the dataset and algorithm. Some algorithms have many different hyperparameters that you can tweak, some are very sensitive to the hyperparameter values selected, and most have a non-linear relationship between model fit and hyperparameter values. Automatic model tuning speeds up the tuning process: it runs multiple training jobs with different hyperparameter combinations to find the set with the best model performance.

As shown in the following code example, to use automatic model tuning, first specify the hyperparameters to tune, their tuning ranges, and an objective metric to optimize. A HyperparameterTuner object takes these as parameters. Each tuning job also must specify a maximum number of training jobs within the tuning job, in this case 15, and how much parallelism to employ, in this case five jobs at a time. With these parameters, the tuning job is complete after three series of five jobs in parallel are run. For the default Bayesian Optimization tuning strategy, the results of previous groups of training jobs inform the tuning search, so it’s preferable to divide them into groups of parallel jobs instead of running all in parallel. There is a trade-off: using more parallel jobs finishes tuning sooner, but likely sacrifices tuning search accuracy.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
  'learning_rate': ContinuousParameter(0.001, 0.2, scaling_type="Logarithmic"),
  'epochs': IntegerParameter(10, 50),
  'batch_size': IntegerParameter(64, 256),
}

metric_definitions = [{'Name': 'loss',
                       'Regex': ' loss: ([0-9\\.]+)'},
                     {'Name': 'val_loss',
                       'Regex': ' val_loss: ([0-9\\.]+)'}]

objective_metric_name = 'val_loss'
objective_type = 'Minimize'

tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=15,
                            max_parallel_jobs=5,
                            objective_type=objective_type)

tuning_job_name = "tf-2-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime()))
tuner.fit(inputs, job_name=tuning_job_name)

Deployment and workflow automation with the AWS Step Functions Data Science SDK

A convenient option to deploy the best model from tuning is an Amazon SageMaker hosted endpoint, which serves real-time predictions (batch transform jobs also are available for asynchronous, offline predictions). The endpoint retrieves the TensorFlow SavedModel and deploys it to an Amazon SageMaker TensorFlow Serving container. You can accomplish this with one line of code by calling the HyperparameterTuner object’s deploy method:

tuning_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

However, although notebooks are great for prototyping, notebooks aren’t typically used for deployment in a production environment. Instead, a workflow orchestrator is preferable for running a pipeline with multiple steps including training and deployment. For example, a simple pipeline in Amazon SageMaker consists of four steps:

Training the model.
Creating an Amazon SageMaker Model object that wraps the model artifact for serving.
Creating an Amazon SageMaker endpoint configuration specifying how the model should be served (including instance type and number of instances).
Deploying the trained model to the configured Amazon SageMaker endpoint.

The AWS Step Functions Data Science SDK automates the process of creating and running such pipelines using Amazon SageMaker and AWS Step Functions, a serverless workflow orchestration service. This SDK enables workflow creation using short, simple Python scripts that define workflow steps and chain them together. AWS Step Functions coordinates all the workflow steps without any need for you to manage the underlying infrastructure.

Although the AWS Step Functions Data Science SDK provides various primitives to build up complex pipelines from scratch, it also has prebuilt templates for common workflows, including a simple TrainingPipeline workflow for model training and deployment. The following code configures such a pipeline with just a few parameters, primarily the training estimator and input and output locations in Amazon S3:

import stepfunctions
from stepfunctions.template.pipeline import TrainingPipeline

workflow_execution_role = "<StepFunctions-execution-role-arn>"

pipeline = TrainingPipeline(
    estimator=estimator,
    role=workflow_execution_role,
    inputs=inputs,
    s3_bucket=bucket
)

After you define a pipeline, you can visualize it as a graph, instantiate it, and execute it as many times as needed. In fact, you can run multiple workflows in parallel. While a workflow is running, you can check workflow progress either in the AWS Step Functions console or by calling the pipeline’s render_progress method. The following diagram shows a rendered workflow execution making progress on the training step.

The AWS Step Functions Data Science SDK enables many other possible workflows for automating TensorFlow 2 and other machine learning projects. One example is a workflow to automate model retraining periodically. Such a workflow could include a test of model quality after training, with subsequent conditional branches for the cases of passing the quality test (model is deployed) or failing (no model deployment). Other possible workflow steps include automatic model tuning, ETL with AWS Glue, and more. For more information about retraining workflows, see Automating model retraining and deployment using the AWS Step Functions Data Science SDK for Amazon SageMaker.

Conclusion

This post discussed Amazon SageMaker features for data transformation, prototyping training and inference code, automatic model tuning, and hosted training and inference. Additionally, you learned how the AWS Step Functions Data Science SDK helps automate workflows after project prototyping is complete. All these features are central elements of projects involving TensorFlow 2 and other deep learning frameworks in Amazon SageMaker.

In addition to these features, many others may be applicable. For example, to handle common problems during model training such as vanishing or exploding gradients, Amazon SageMaker Debugger is useful. To manage common problems such as data drift for models deployed in production, you can apply Amazon SageMaker Model Monitor. For more information about the Amazon SageMaker workflow features covered in this post, see the related GitHub repo.

About the Author

Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.

AWS Machine Learning Blog