TensorFlow SageMaker Estimators allow you to run your own TensorFlow training algorithms on SageMaker Learner, and to host your own TensorFlow models on SageMaker Hosting.
Documentation of the previous Legacy Mode versions: 1.4.1, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0
| WARNING |
|---|
| We have added a new format of your TensorFlow training script with TensorFlow version 1.11. This new way gives the user script more flexibility. This new format is called Script Mode, as opposed to Legacy Mode, which is what we support with TensorFlow 1.11 and older versions. In addition we are adding Python 3 support with Script Mode. Last supported version of Legacy Mode will be TensorFlow 1.12. Script Mode is available with TensorFlow version 1.11 and newer. Make sure you refer to the correct version of this README when you prepare your script. You can find the Legacy Mode README here. |
Supported versions of TensorFlow for Elastic Inference: 1.11.0.
Training TensorFlow models using sagemaker.tensorflow.TensorFlow is a two-step process.
First, you prepare your training script, then second, you run it on
SageMaker Learner via the sagemaker.tensorflow.TensorFlow estimator.
Your TensorFlow training script must be a Python 2.7- or 3.6-compatible source file.
The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following:
SM_MODEL_DIR: A string that represents the local path where the training job can write the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than themodel_dirargument passed in your training script which is a S3 location.SM_MODEL_DIRis always set to/opt/ml/model.SM_NUM_GPUS: An integer representing the number of GPUs available to the host.SM_OUTPUT_DATA_DIR: A string that represents the path to the directory to write output artifacts to. Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. These artifacts are compressed and uploaded to S3 to an S3 bucket with the same prefix as the model artifacts.SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the TensorFlow estimator'sfitcall, named 'train' and 'test', the environment variablesSM_CHANNEL_TRAINandSM_CHANNEL_TESTare set.
For the exhaustive list of available environment variables, see the SageMaker Containers documentation.
A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to SM_CHANNEL_TRAIN so that it can be deployed for inference later.
Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance.
For example, a training script might start with the following:
import argparse
import os
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.1)
# input data and model directories
parser.add_argument('--model_dir', type=str)
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
args, _ = parser.parse_known_args()
# ... load from args.train and args.test, train a model, write model to args.model_dir.Because the SageMaker imports your training script, putting your training launching code in a main guard (if __name__=='__main__':)
is good practice.
Note that SageMaker doesn't support argparse actions.
If you want to use, for example, boolean hyperparameters, you need to specify type as bool in your script and provide an explicit True or False value for this hyperparameter when instantiating your TensorFlow estimator.
If you have a TensorFlow training script that runs outside of SageMaker please follow the directions here:
1. Make sure your script can handle --model_dir as an additional command line argument. If you did not specify a
location when the TensorFlow estimator is constructed a S3 location under the default training job bucket will be passed
in here. Distributed training with parameter servers requires you use the tf.estimator.train_and_evaluate API and
a S3 location is needed as the model directory during training. Here is an example:
estimator = tf.estimator.Estimator(model_fn=my_model_fn, model_dir=args.model_dir)
...
train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(eval_input_fn)
tf.estimator.train_and_evaluate(mnist_classifier, train_spec, eval_spec)- Load input data from the input channels. The input channels are defined when
fitis called. For example:
estimator.fit({'train':'s3://my-bucket/my-training-data',
'eval':'s3://my-bucket/my-evaluation-data'})In your training script the channels will be stored in environment variables SM_CHANNEL_TRAIN and
SM_CHANNEL_EVAL. You can add them to your argument parsing logic like this:
parser = argparse.ArgumentParser()
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--eval', type=str, default=os.environ.get('SM_CHANNEL_EVAL'))- Export your final model to path stored in environment variable
SM_MODEL_DIRwhich should always be/opt/ml/model. At end of training SageMaker will upload the model file under/opt/ml/modeltooutput_path.
To use Script Mode, set at least one of these args
py_version='py3'script_mode=True
Please note that when using Script Mode, your training script need to accept the following args:
model_dir
Please note that the following args are not permitted when using Script Mode:
checkpoint_pathtraining_stepsevaluation_stepsrequirements_file
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole',
train_instance_count=1, train_instance_type='ml.p2.xlarge',
framework_version='1.11', py_version='py3')
tf_estimator.fit('s3://bucket/path/to/training/data')Where the S3 url is a path to your training data, within Amazon S3. The constructor keyword arguments define how SageMaker runs your training script which we discussed earlier.
You start your training script by calling fit on a TensorFlow estimator. fit takes
both required and optional arguments.
inputs: The S3 location(s) of datasets to be used for training. This can take one of two forms:str: An S3 URI, for examples3://my-bucket/my-training-data, which indicates the dataset's location.dict[str, str]: A dictionary mapping channel names to S3 locations, for example{'train': 's3://my-bucket/my-training-data/train', 'test': 's3://my-bucket/my-training-data/test'}sagemaker.session.s3_input: channel configuration for S3 data sources that can provide additional information as well as the path to the training dataset. See the API docs for full details.
wait (bool): Defaults to True, whether to block and wait for the training script to complete before returning. If set to False, it will return immediately, and can later be attached to.logs (bool): Defaults to True, whether to show logs produced by training job in the Python session. Only meaningful when wait is True.run_tensorboard_locally (bool): Defaults to False. If set to True a Tensorboard command will be printed out.job_name (str): Training job name. If not specified, the estimator generates a default job name, based on the training image name and current timestamp.
Calling fit starts a SageMaker training job. The training job will execute the following.
- Starts
train_instance_countEC2 instances of the typetrain_instance_type. - On each instance, it will do the following steps:
- starts a Docker container optimized for TensorFlow.
- downloads the dataset.
- setup up training related environment varialbes
- setup up distributed training environment if configured to use parameter server
- starts asynchronous training
If the wait=False flag is passed to fit, then it will return immediately. The training job will continue running
asynchronously. At a later time, a Tensorflow Estimator can be obtained by attaching to the existing training job. If
the training job is not finished it will start showing the standard output of training and wait until it completes.
After attaching, the estimator can be deployed as usual.
tf_estimator.fit(your_input_data, wait=False)
training_job_name = tf_estimator.latest_training_job.name
# after some time, or in a separate Python notebook, we can attach to it again.
tf_estimator = TensorFlow.attach(training_job_name=training_job_name)The TensorFlow constructor takes both required and optional arguments.
Required:
entry_point (str)Path (absolute or relative) to the Python file which should be executed as the entry point to training.role (str)An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if accessing AWS resource.train_instance_count (int)Number of Amazon EC2 instances to use for training.train_instance_type (str)Type of EC2 instance to use for training, for example, 'ml.c4.xlarge'.
Optional:
source_dir (str)Path (absolute or relative) to a directory with any other training source code dependencies including the entry point file. Structure within this directory will be preserved when training on SageMaker.dependencies (list[str])A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default:[]). The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. If thesource_dirpoints to S3, code will be uploaded and the S3 location will be used instead. Example:The following call
>>> TensorFlow(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env'])
results in the following inside the container:
>>> opt/ml/code >>> ├── train.py >>> ├── common >>> └── virtual-env
hyperparameters (dict[str, ANY])Hyperparameters that will be used for training. Will be made accessible as command line arguments.train_volume_size (int)Size in GB of the EBS volume to use for storing input data during training. Must be large enough to the store training data.train_max_run (int)Timeout in seconds for training, after which Amazon SageMaker terminates the job regardless of its current status.output_path (str)S3 location where you want the training result (model artifacts and optional output files) saved. If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, the estimator creates the bucket during thefitmethod execution.output_kms_keyOptional KMS key ID to optionally encrypt training output with.base_job_nameName to assign for the training job that thefitmethod launches. If not specified, the estimator generates a default job name, based on the training image name and current timestamp.image_nameAn alternative docker image to use for training and serving. If specified, the estimator will use this image for training and hosting, instead of selecting the appropriate SageMaker official image based onframework_versionandpy_version. Refer to: SageMaker TensorFlow Docker Containers for details on what the official images support and where to find the source code to build your custom image.script_mode (bool)Whether to use Script Mode or not. Script mode is the only available training mode in Python 3, settingpy_versiontopy3automatically setsscript_modeto True.model_dir (str)Location where model data, checkpoint data, and TensorBoard checkpoints should be saved during training. If not specified a S3 location will be generated under the training job's default bucket. Andmodel_dirwill be passed in your training script as one of the command line arguments.distributions (dict)Configure your distrubtion strategy with this argument. For launching parameter server for for distributed training, you must setdistributionsto{'parameter_server': {'enabled': True}}
Amazon SageMaker allows users to create training jobs using Pipe input mode. With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. This means that your training jobs start sooner, finish quicker, and need less disk space.
SageMaker TensorFlow provides an implementation of tf.data.Dataset that makes it easy to take advantage of Pipe
input mode in SageMaker. You can replace your tf.data.Dataset with a sagemaker_tensorflow.PipeModeDataset to
read TFRecords as they are streamed to your training instances.
In your entry_point script, you can use PipeModeDataset like a Dataset. In this example, we create a
PipeModeDataset to read TFRecords from the 'training' channel:
from sagemaker_tensorflow import PipeModeDataset
features = {
'data': tf.FixedLenFeature([], tf.string),
'labels': tf.FixedLenFeature([], tf.int64),
}
def parse(record):
parsed = tf.parse_single_example(record, features)
return ({
'data': tf.decode_raw(parsed['data'], tf.float64)
}, parsed['labels'])
def train_input_fn(training_dir, hyperparameters):
ds = PipeModeDataset(channel='training', record_format='TFRecord')
ds = ds.repeat(20)
ds = ds.prefetch(10)
ds = ds.map(parse, num_parallel_calls=10)
ds = ds.batch(64)
return dsTo run training job with Pipe input mode, pass in input_mode='Pipe' to your TensorFlow Estimator:
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='tf-train-with-pipemodedataset.py', role='SageMakerRole',
training_steps=10000, evaluation_steps=100,
train_instance_count=1, train_instance_type='ml.p2.xlarge',
framework_version='1.10.0', input_mode='Pipe')
tf_estimator.fit('s3://bucket/path/to/training/data')If your TFRecords are compressed, you can train on Gzipped TF Records by passing in compression='Gzip' to the call to
fit(), and SageMaker will automatically unzip the records as data is streamed to your training instances:
from sagemaker.session import s3_input
train_s3_input = s3_input('s3://bucket/path/to/training/data', compression='Gzip')
tf_estimator.fit(train_s3_input)You can learn more about PipeModeDataset in the sagemaker-tensorflow-extensions repository: https://github.com/aws/sagemaker-tensorflow-extensions
SageMaker TensorFlow CPU images use TensorFlow built with Intel® MKL-DNN optimization.
In certain cases you might be able to get a better performance by disabling this optimization (for example when using small models)
You can disable MKL-DNN optimization for TensorFlow 1.8.0 and above by setting two following environment variables:
import os
os.environ['TF_DISABLE_MKL'] = '1'
os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1'After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel in
the S3 location defined by output_path. You can call deploy on a TensorFlow
estimator to create a SageMaker Endpoint.
SageMaker provides two different options for deploying TensorFlow models to a SageMaker Endpoint:
The first option uses a Python-based server that allows you to specify your own custom input and output handling functions in a Python script. This is the default option.
See Deploying to Python-based Endpoints to learn how to use this option.
The second option uses a TensorFlow Serving-based server to provide a super-set of the TensorFlow Serving REST API. This option does not require (or allow) a custom python script.
See Deploying to TensorFlow Serving Endpoints to learn how to use this option.
The containers include the following Python packages:
| Dependencies | Script Mode | Legacy Mode |
| boto3 | Latest | Latest |
| botocore | Latest | Latest |
| CUDA (GPU image only) | 9.0 | 9.0 |
| numpy | Latest | Latest |
| Pillow | Latest | Latest |
| scipy | Latest | Latest |
| sklean | Latest | Latest |
| h5py | Latest | Latest |
| pip | 18.1 | 18.1 |
| curl | Latest | Latest |
| tensorflow | 1.11.0 | 1.11.0 |
| tensorflow-serving-api | 1.11.0 | None |
| sagemaker-containers | >=2.3.5 | >=2.3.5 |
| sagemaker-tensorflow-container | 1.0 | 1.0 |
| Python | 2.7 or 3.6 | 2.7 |
Legacy Mode TensorFlow Docker images support Python 2.7. Script Mode TensorFlow Docker images support both Python 2.7 and Python 3.6. The Docker images extend Ubuntu 16.04.
You can select version of TensorFlow by passing a framework_version keyword arg to the TensorFlow Estimator constructor. Currently supported versions are listed in the table above. You can also set framework_version to only specify major and minor version, e.g '1.6', which will cause your training script to be run on the latest supported patch version of that minor version, which in this example would be 1.6.0.
Alternatively, you can build your own image by following the instructions in the SageMaker TensorFlow containers
repository, and passing image_name to the TensorFlow Estimator constructor.
For more information on the contents of the images, see the SageMaker TensorFlow containers repository here: https://github.com/aws/sagemaker-tensorflow-containers/