This is still under development, any of the following instructions may have significant changes in the future.
-
You need to have a Kubernetes cluster configured to use GPUs. Currently tested with Kubernetes configured using feature gate
Accelerators. If you are using device plugin, please modify the values.yaml's lcm.version todevice-pluginand redeploy FfDL. -
You need to have FfDL running on your Cluster.
-
Currently Tensorflow and Caffe are tested with GPUs.
To run the TensorFlow job with GPU, simply go to the tf-model's manifest file and do the following changes
- Change the framework version from
latesttolatest-gpufor CUDA 9 Driver or1.3.0-gpufor CUDA 8 Driver. - Change the
gpussection to be greater than 0, so the learner can get GPU resource to train the job.
The etc/examples/tf-model/gpu-manifest.yml is the example manifest file for running the TensorFlow example with GPU. Once you have done the above changes, you can following the same testing instructions on the main README to run the sample TensorFlow job on GPU.
To run the Caffe job with GPU, simply go to the caffe-model's manifest file and do the following changes
- Change the Framework version from
cputogpu. - Change the
gpussection to be greater than 0, so the learner can get GPU resource to train the job. - Add the caffe GPU flag in the
commandsection (e.g. Change thecommandfromcaffe train -solver lenet_solver.prototxttocaffe train -gpu all -solver lenet_solver.prototxt). - Lastly, go to the
lenet_solver.prototxtfile and changesolver_modeto GPU to enable Caffe to run on GPU.
The etc/examples/caffe-model/gpu-manifest.yml is the example manifest file for running the Caffe example with GPU. Once you have done the above changes, you can following the same testing instructions on the main README to run the sample TensorFlow job on GPU.
You can go to the user guide to learn more about how to modify the model manifest file and run GPU jobs with your own setting. Note that you must select the framework versions that support GPU and set the gpus section greater than 0 in order to execute your job with GPU in the manifest file.