diff --git a/.gitignore b/.gitignore index 0df9f2eb..ebd40607 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,17 @@ - -aws-deepracer-workshops/ - -deepracer/ - +.vscode/ +custom_files/ +logs/ docker/volumes/ +recording/ +recording +/*.env +/*.bak +/*.tar +/*.json +DONE +data/ +tmp/ +autorun.s3url +nohup.out +/*.sh +_ diff --git a/LICENSE b/LICENSE new file mode 100644 index 00000000..39e20796 --- /dev/null +++ b/LICENSE @@ -0,0 +1,14 @@ +Copyright 2019-2023 AWS DeepRacer Community. All Rights Reserved. + +Permission is hereby granted, free of charge, to any person obtaining a copy of this +software and associated documentation files (the "Software"), to deal in the Software +without restriction, including without limitation the rights to use, copy, modify, +merge, publish, distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, +INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A +PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/README.md b/README.md index 2830428a..4d0acf2c 100644 --- a/README.md +++ b/README.md @@ -1,235 +1,70 @@ -# DeepRacer-For-Dummies -Provides a quick and easy way to get up and running with a local deepracer training environment using Docker Compose. -This repo just creates a wrapper around the amazing work done by Chris found here: https://github.com/crr0004/deepracer -Please refer to his repo to understand more about what's going on under the covers. +# DeepRacer-For-Cloud +Provides a quick and easy way to get up and running with a DeepRacer training environment using a cloud virtual machine or a local compter, such [AWS EC2 Accelerated Computing instances](https://aws.amazon.com/ec2/instance-types/?nc1=h_ls#Accelerated_Computing) or the Azure [N-Series Virtual Machines](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu). -# Video Instructions -[![Video Instructions](https://img.youtube.com/vi/CFNcKmtVRSI/0.jpg)](https://www.youtube.com/watch?v=CFNcKmtVRSI) +DRfC runs on Ubuntu 22.04 and 24.04. GPU acceleration requires a NVIDIA GPU, preferrably with more than 8GB of VRAM. -# Getting Started +## Introduction ---- -#### Prerequisites +DeepRacer-For-Cloud (DRfC) started as an extension of the work done by Alex (https://github.com/alexschultz/deepracer-for-dummies), which is again a wrapper around the amazing work done by Chris (https://github.com/crr0004/deepracer). With the introduction of the second generation Deepracer Console the repository has been split up. This repository contains the scripts needed to *run* the training, but depends on Docker Hub to provide pre-built docker images. All the under-the-hood building capabilities are in the [Deepracer Simapp](https://github.com/aws-deepracer-community/deepracer-simapp) repository. -* This project is specifically built to run on Ubuntu 18.04 with an **Nvidia GPU**. It is assumed you already have **CUDA/CUDNN** installed and configured. +As if December 2025 the original DeepRacer service in the AWS console is no longer available, and is replaced by [DeepRacer-on-AWS](https://aws.amazon.com/solutions/implementations/deepracer-on-aws/) which you can install in your own AWS environment. DeepRacer-For-Cloud is independent of any AWS service, so it is not directly impacted by this change. -* You also need to have **Docker** installed as well as the **Nvidia-Docker** runtime. +## Main Features -* You should have an AWS account with the **AWS cli** installed. The credentials should be located in your home directory (~/.aws/credentials) +DRfC supports a wide set of features to ensure that you can focus on creating the best model: +* User-friendly + * Based on the continously updated community [Robomaker](https://github.com/aws-deepracer-community/deepracer-simapp) container, supporting a wide range of CPU and GPU setups. + * Wide set of scripts (`dr-*`) enables effortless training. + * Detection of your AWS DeepRacer Console models; allows upload of a locally trained model to any of them. +* Modes + * Time Trial + * Object Avoidance + * Head-to-Bot +* Training + * Multiple Robomaker instances per Sagemaker (N:1) to improve training progress. + * Multiple training sessions in parallel - each being (N:1) if hardware supports it - to test out things in parallel. + * Connect multiple nodes together (Swarm-mode only) to combine the powers of multiple computers/instances. +* Evaluation + * Evaluate independently from training. + * Save evaluation run to MP4 file in S3. +* Logging + * Training metrics and trace files are stored to S3. + * Optional integration with AWS CloudWatch. + * Optional exposure of Robomaker internal log-files. +* Technology + * Supports both Docker Swarm (used for connecting multiple nodes together) and Docker Compose -* ensure you have **vncviewer** installed +## Tech Stack -#### NOTE: If you already have these prerequisites setup then you can simply run the init.sh script described in the **Initialization** section. If you are setting everything up for the first time, then the information provided here can help you to get your environment ready to use this repo. +DRfC is built on top of the [AWS DeepRacer Simapp](https://github.com/aws-deepracer-community/deepracer-simapp) — a single Docker image used for three purposes: +* **Robomaker** — one or more containers providing robotics simulation via ROS and Gazebo +* **Sagemaker** — container running the model training job +* **RL Coach** — container that bootstraps the Sagemaker container using the Sagemaker SDK and Sagemaker Local -#### Local Environment Setup +### Core Technologies -If you are running Windows and would like to use this repo, you will need to modify the process to get everything to run on Windows (not recommended as you will not be able to take advantage of the GPU during training) Many users have found it useful to dual-boot (Windows/Linux). There are many tutorials online for how to do this. You can follow the instructions provided below as guidance. +| Component | Version | +|-----------|---------| +| Ubuntu | 24.04 | +| Python | 3.12 | +| TensorFlow | 2.20 | +| CUDA | 12.6 (GPU only) | +| Redis | 8.0.4 | +| ROS | 2 Jazzy | +| Gazebo | Harmonic | -##### * Installing Ubuntu 18.04 with Windows 10 +### Images -https://medium.com/bigdatarepublic/dual-boot-windows-and-linux-aa281c3c01f9 +Pre-built images are available on [Docker Hub](https://hub.docker.com/repository/docker/awsdeepracercommunity/deepracer-simapp) as `awsdeepracercommunity/deepracer-simapp:-cpu` (CPU) and `awsdeepracercommunity/deepracer-simapp:-gpu` (CUDA GPU). Both support OpenGL acceleration. -When it gets to the Disk Management part, to make space for your Ubuntu installation, followed this guide and specifically look at the 2nd method (MiniTool Partition Wizard): +During installation DRfC will automatically pull the latest image based on whether you have a GPU or CPU installation. -https://win10faq.com/shrink-partition-windows-10/?source=post_page--------------------------- +## Documentation -======= -##### * Installing the AWS CLI +Full documentation can be found on the [Deepracer-for-Cloud GitHub Pages](https://aws-deepracer-community.github.io/deepracer-for-cloud). - pip install -U awscli - -Then Follow this: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html +## Support -##### * Installing Docker-ce (steps from https://docs.docker.com/install/linux/docker-ce/ubuntu/ ) - - sudo apt-get remove docker docker-engine docker.io containerd runc - sudo apt-get update - - sudo apt-get install \ - apt-transport-https \ - ca-certificates \ - curl \ - gnupg-agent \ - software-properties-common - - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - - sudo apt-key fingerprint 0EBFCD88 - - sudo add-apt-repository \ - "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ - $(lsb_release -cs) \ - stable" - - sudo apt-get update - sudo apt-get install docker-ce docker-ce-cli containerd.io - -Verify docker works - - sudo docker run hello-world - -##### 3. Installing Docker-compose (from https://docs.docker.com/compose/install/#install-compose ) - - curl -L https://github.com/docker/compose/releases/download/1.24.1/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose - sudo chmod +x /usr/local/bin/docker-compose - -Verify installation - - docker-compose --version - -###### NOTE: You can also choose to install docker-compose via another package manager (i.e. pip or conda), but if you do, make sure to do so in a virtual env. Many OS’s have python system packages that conflict with docker-compose dependencies. ###### - -Additionally, make sure your user-id can run docker without sudo (from https://docs.docker.com/install/linux/linux-postinstall/ ) - - sudo groupadd docker - sudo usermod -aG docker $USER - -Log out and log back in so that your group membership is re-evaluated. - -And configure Docker to start on boot. - - sudo systemctl enable docker - -##### * Preparing for nvidia-docker - -The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. -Nvidia-docker essentially exposes the GPU to the containers to use: https://github.com/NVIDIA/nvidia-docker - -You may want to note what you have installed currently. - - sudo apt list --installed | grep nvidia - -Then prepare for clean installation of Nvidia drivers. - - sudo apt-get purge nvidia* - -##### Installing nvidia-docker runtime (from https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-2.0) ) - - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) - curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list - sudo apt-get update - sudo apt-get install nvidia-docker2 - sudo pkill -SIGHUP dockerd - -##### * Installing the proper nvidia drivers - -Check for driver version here according to your GPU(s): https://www.nvidia.com/Download/index.aspx?lang=en-us -In the dropdown for OS, choose “show all OS’s” to see if there are Ubuntu specific choices. Otherwise choose Linux. -If you get a dropdown for “cuda toolkit”, choose 10.0) - - sudo add-apt-repository ppa:graphics-drivers - sudo apt-get update - sudo apt install nvidia-driver-410 && sudo reboot - -###### NOTE: 410 is a driver version that is compatible with the GPU I selected on the Nvidia website. ###### - -Verify the driver installation: - - nvidia-smi - nvcc --version - -##### * Installing VNC viewer on your local machine - -This doc is straight forward: https://www.techspot.com/downloads/5760-vnc-viewer.html - -##### * Installing the Nvidia deep learning libraries (CUDA/CUDNN) for GPU hardware: - -This guide goes through how to install CUDA & CUDNN : https://medium.com/@zhanwenchen/install-cuda-and-cudnn-for-tensorflow-gpu-on-ubuntu-79306e4ac04e - -###### NOTE: You can apparently use Anaconda instead to install CUDA/CUDNN. I have not tried this, however some users have and have reported that this method is much easier. If you use this approach, you will need to first install Anaconda. Once installed you can then use the conda package manager to install the desired versions of CUDA and cuDNN. The following installation configuration has been reported to work together successfully ###### - -##### Downloading Anaconda - - sudo apt-get update -y && sudo apt-get upgrade -y - cd /tmp/ - sudo wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh - -##### Installing Anaconda - - bash Anaconda3-2019.03-Linux-x86_64.sh - "yes" for using the default directory location - “yes” for running conda init - -##### Activating Anaconda - - source ~/.bashrc - -##### Verifying the conda package manager works - - conda list - -##### Installing CUDA/CUDNN - - conda install cudnn==7.3.1 && conda install -c fragcolor cuda10.0 - - -#### Initialization (After all prerequisites have been installed) - - -##### 11. Run Init.sh from this repo (refer to the rest of this doc for script details) - -In a command prompt, simply run "./init.sh". -This will set everything up so you can run the deepracer local training environment. - - -**init.sh** performs these steps so you don't have to do them manually: -1. Clones Chris's repo: https://github.com/crr0004/deepracer.git -2. Does a mkdir -p ~/.sagemaker && cp config.yaml ~/.sagemaker -3. Sets the image name in rl_deepracer_coach_robomaker.py to "crr0004/sagemaker-rl-tensorflow:nvidia” -4. Also sets the instance_type in rl_deepracer_coach_robomaker.py to “local_gpu” -5. Copies the reward.py and model-metadata files into your Minio bucket - - -To start or stop the local deepracer training, use the scripts found in the scripts directory. - -Here is a brief overview of the available scripts: - -#### Scripts - -* training - * start.sh - * starts the whole environment using docker compose - * it will also open a terminal window where you can monitor the log output from the sagemaker training directory - * it will also automatically open vncviewer so you can watch the training happening in Gazebo - * stop.sh - * stops the whole environment - * automatically finds and stops the training container which was started from the sagemaker container - * upload-snapshot.sh - * uploads a specific snapshot to S3 in AWS. If no checkpoint is provided, it attempts to retrieve the latest snapshot - * set-last-run-to-pretrained.sh - * renames the last training run directory from ***rl-deepracer-sagemaker*** to ***rl-deepracer-pretrained*** so that you can use it as a starting point for a new training run. - * delete-last-run.sh - * (WARNING: this script deletes files on your system. I take no responsibility for any resulting actions by running this script. Please look at what the script is doing before running it so that you understand) - * deletes the last training run including all of the snapshots and log files. You will need sudo to run this command. - - -* evaluation - * start.sh - * starts the whole environment using docker compose to run an evaluation run - * it will also open a terminal window where you can monitor the log output from the sagemaker training directory - * it will also automatically open vncviewer so you can watch the training happening in Gazebo - * stop.sh - * stops the whole environment - * automatically finds and stops the training container which was started from the sagemaker container - -* log-analysis - * start.sh - * starts a container with Nvidia-Docker running jupyter labs with the log analysis notebooks which were originally provided by AWS and then extended by Tomasz Ptak - * the logs from robomaker are automatically mounted in the container so you don't have to move any files around - * in order to get to the container, look at the log output from when it starts. You need to grab the URL including the token query parameter and then paste it into the brower at **localhost:8888**. - * stop.sh - * stops the log-analysis container - - -#### Hyperparameters - -You can modify training hyperparameters from the file **rl_deepracer_coach_robomaker.py**. - -#### Action Space & Reward Function - -The action-space and reward function files are located in the **deepracer-for-dummies/docker/volumes/minio/bucket/custom_files** directory - -#### Track Selection - -The track selection is controled via an environment variable in the **.env** file located in the **deepracer-for-dummies/docker** directory \ No newline at end of file +* For general support it is suggested to join the [AWS DeepRacing Community](https://deepracing.io/). The Community Slack has a channel #dr-training-local where the community provides active support. +* Create a GitHub issue if you find an actual code issue, or where updates to documentation would be required. diff --git a/bin/activate.sh b/bin/activate.sh new file mode 100644 index 00000000..3a719c4b --- /dev/null +++ b/bin/activate.sh @@ -0,0 +1,262 @@ +#!/bin/bash + +verlte() { + [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ] +} + +function dr-update-env { + + if [[ -f "$DIR/system.env" ]]; then + LINES=$(grep -v '^#' $DIR/system.env) + for l in $LINES; do + env_var=$(echo $l | cut -f1 -d\=) + env_val=$(echo $l | cut -f2 -d\=) + eval "export $env_var=$env_val" + done + else + echo "File system.env does not exist." + return 1 + fi + + if [[ -f "$DR_CONFIG" ]]; then + LINES=$(grep -v '^#' $DR_CONFIG) + for l in $LINES; do + env_var=$(echo $l | cut -f1 -d\=) + env_val=$(echo $l | cut -f2 -d\=) + eval "export $env_var=$env_val" + done + else + echo "File run.env does not exist." + return 1 + fi + + if [[ -z "${DR_RUN_ID}" ]]; then + export DR_RUN_ID=0 + fi + + if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + export DR_ROBOMAKER_TRAIN_PORT=$(expr 8080 + $DR_RUN_ID) + export DR_ROBOMAKER_EVAL_PORT=$(expr 8180 + $DR_RUN_ID) + export DR_ROBOMAKER_GUI_PORT=$(expr 5900 + $DR_RUN_ID) + else + export DR_ROBOMAKER_TRAIN_PORT="8080-8089" + export DR_ROBOMAKER_EVAL_PORT="8080-8089" + export DR_ROBOMAKER_GUI_PORT="5901-5920" + fi + + # Setting the default region to ensure that things work also in the + # non default regions. + export AWS_DEFAULT_REGION=${DR_AWS_APP_REGION} + +} + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" +DIR="$(dirname $SCRIPT_DIR)" +export DR_DIR=$DIR + +if [[ -f "$1" ]]; then + export DR_CONFIG=$(readlink -f $1) + dr-update-env +elif [[ -f "$DIR/run.env" ]]; then + export DR_CONFIG="$DIR/run.env" + dr-update-env +else + echo "No configuration file." + return 1 +fi + +# Check if Docker runs -- if not, then start it. +if [[ "$(type service 2>/dev/null)" ]]; then + service docker status >/dev/null || sudo service docker start +fi + +## Check if WSL2 +if grep -qi Microsoft /proc/version && grep -q "WSL2" /proc/version; then + IS_WSL2="yes" +fi + +# Check if we will use Docker Swarm or Docker Compose +# If not defined then use Swarm +if [[ -z "${DR_DOCKER_STYLE}" ]]; then + export DR_DOCKER_STYLE="swarm" +fi + +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + export DR_DOCKER_FILE_SEP="-c" + SWARM_NODE=$(docker node inspect self | jq .[0].ID -r) + SWARM_NODE_UPDATE=$(docker node update --label-add Sagemaker=true $SWARM_NODE) +else + export DR_DOCKER_FILE_SEP="-f" +fi + +# Check if CUDA_VISIBLE_DEVICES is configured. +if [[ -n "${CUDA_VISIBLE_DEVICES}" ]]; then + echo "WARNING: You have CUDA_VISIBLE_DEVICES defined. The will no longer work as" + echo " expected. To control GPU assignment use DR_ROBOMAKER_CUDA_DEVICES" + echo " and DR_SAGEMAKER_CUDA_DEVICES and rlcoach v5.0.1 or later." +fi + +# Check if CUDA_VISIBLE_DEVICES is configured. +if [ "${DR_CLOUD,,}" == "local" ] && [ -z "${DR_MINIO_IMAGE}" ]; then + echo "WARNING: You have not configured DR_MINIO_IMAGE in system.env." + echo " System will default to tag RELEASE.2022-10-24T18-35-07Z" + export DR_MINIO_IMAGE="RELEASE.2022-10-24T18-35-07Z" +fi + +# Prepare the docker compose files depending on parameters +if [[ "${DR_CLOUD,,}" == "azure" ]]; then + export DR_LOCAL_S3_ENDPOINT_URL="http://localhost:9000" + export DR_MINIO_URL="http://minio:9000" + DR_LOCAL_PROFILE_ENDPOINT_URL="--profile $DR_LOCAL_S3_PROFILE --endpoint-url $DR_LOCAL_S3_ENDPOINT_URL" + DR_TRAIN_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-training.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-endpoint.yml" + DR_EVAL_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-eval.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-endpoint.yml" + DR_MINIO_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-local.yml" +elif [[ "${DR_CLOUD,,}" == "local" ]]; then + export DR_LOCAL_S3_ENDPOINT_URL="http://localhost:9000" + export DR_MINIO_URL="http://minio:9000" + DR_LOCAL_PROFILE_ENDPOINT_URL="--profile $DR_LOCAL_S3_PROFILE --endpoint-url $DR_LOCAL_S3_ENDPOINT_URL" + DR_TRAIN_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-training.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-endpoint.yml" + DR_EVAL_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-eval.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-endpoint.yml" + DR_MINIO_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-local.yml" +elif [[ "${DR_CLOUD,,}" == "remote" ]]; then + export DR_LOCAL_S3_ENDPOINT_URL="$DR_REMOTE_MINIO_URL" + export DR_MINIO_URL="$DR_REMOTE_MINIO_URL" + DR_LOCAL_PROFILE_ENDPOINT_URL="--profile $DR_LOCAL_S3_PROFILE --endpoint-url $DR_LOCAL_S3_ENDPOINT_URL" + DR_TRAIN_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-training.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-endpoint.yml" + DR_EVAL_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-eval.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-endpoint.yml" + DR_MINIO_COMPOSE_FILE="" +elif [[ "${DR_CLOUD,,}" == "aws" ]]; then + DR_LOCAL_PROFILE_ENDPOINT_URL="" + DR_TRAIN_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-training.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-aws.yml" + DR_EVAL_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-eval.yml $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-aws.yml" +else + DR_LOCAL_PROFILE_ENDPOINT_URL="" + DR_TRAIN_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-training.yml" + DR_EVAL_COMPOSE_FILE="$DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-eval.yml" +fi + +# Add host X support for Linux and WSL2 +if [[ "${DR_HOST_X,,}" == "true" ]]; then + if [[ "$IS_WSL2" == "yes" ]]; then + + # Check if package x11-server-utils is installed + if ! command -v xset &> /dev/null; then + echo "WARNING: Package x11-server-utils is not installed. Please install it to enable X11 support." + fi + + if [[ "${DR_DOCKER_STYLE,,}" == "swarm" && "${DR_USE_GUI,,}" == "true" ]]; then + echo "WARNING: Cannot use GUI in Swarm mode. Please switch to Compose mode." + fi + + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-local-xorg-wsl.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-local-xorg-wsl.yml" + else + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-local-xorg.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-local-xorg.yml" + fi +fi + +# Chose the rendering engine - OGRE2 if we have a GPU, otherwise OGRE +if [[ -z "${DR_GAZEBO_RENDER_ENGINE}" ]]; then + if [[ -f "/etc/docker/daemon.json" ]] && jq -e '.runtimes.nvidia // (."default-runtime" == "nvidia")' /etc/docker/daemon.json &>/dev/null; then + export DR_GAZEBO_RENDER_ENGINE="ogre2" + fi +fi + +# Prevent docker swarms to restart +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-training-swarm.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-eval-swarm.yml" +fi + +# Enable logs in CloudWatch +if [[ "${DR_CLOUD_WATCH_ENABLE,,}" == "true" ]]; then + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-cwlog.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-cwlog.yml" +fi + +# Enable local simapp mount +if [[ -d "${DR_ROBOMAKER_MOUNT_SIMAPP_DIR,,}" ]]; then + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-simapp.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-simapp.yml" +fi + +# Enable local scripts mount +if [[ -d "${DR_ROBOMAKER_MOUNT_SCRIPTS_DIR,,}" ]]; then + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-robomaker-scripts.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-robomaker-scripts.yml" +fi + +## Check if we have an AWS IAM assumed role, or if we need to set specific credentials. +if [ "${DR_CLOUD,,}" == "aws" ] && [ $(aws --output json sts get-caller-identity 2>/dev/null | jq '.Arn' | awk /assumed-role/ | wc -l) -gt 0 ]; then + export DR_LOCAL_S3_AUTH_MODE="role" +else + export DR_LOCAL_ACCESS_KEY_ID=$(aws --profile $DR_LOCAL_S3_PROFILE configure get aws_access_key_id | xargs) + export DR_LOCAL_SECRET_ACCESS_KEY=$(aws --profile $DR_LOCAL_S3_PROFILE configure get aws_secret_access_key | xargs) + DR_TRAIN_COMPOSE_FILE="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-keys.yml" + DR_EVAL_COMPOSE_FILE="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DIR/docker/docker-compose-keys.yml" + export DR_UPLOAD_PROFILE="--profile $DR_UPLOAD_S3_PROFILE" + export DR_LOCAL_S3_AUTH_MODE="profile" +fi + +export DR_TRAIN_COMPOSE_FILE +export DR_EVAL_COMPOSE_FILE +export DR_LOCAL_PROFILE_ENDPOINT_URL + +if [[ -n "${DR_MINIO_COMPOSE_FILE}" ]]; then + export MINIO_UID=$(id -u) + export MINIO_USERNAME=$(id -u -n) + export MINIO_GID=$(id -g) + export MINIO_GROUPNAME=$(id -g -n) + if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + + if [ "$DR_DOCKER_MAJOR_VERSION" -gt 24 ]; then + DETACH_FLAG="--detach=true" + fi + + docker stack deploy $DR_MINIO_COMPOSE_FILE $DETACH_FLAG s3 + else + docker compose $DR_MINIO_COMPOSE_FILE -p s3 up -d + fi + +fi + +## Version check +if [[ -z "$DR_SIMAPP_SOURCE" || -z "$DR_SIMAPP_VERSION" ]]; then + DEFAULT_SIMAPP_VERSION=$(jq -r '.containers.simapp | select (.!=null)' $DIR/defaults/dependencies.json) + echo "ERROR: Variable DR_SIMAPP_SOURCE or DR_SIMAPP_VERSION not defined." + echo "" + echo "As of version 5.3 the variables DR_SIMAPP_SOURCE and DR_SIMAPP_VERSION are required in system.env." + echo "To continue to use the separate Sagemaker, Robomaker and RL Coach images, run 'git checkout legacy'." + echo "" + echo "Please add the following lines to your system.env file:" + echo "DR_SIMAPP_SOURCE=awsdeepracercommunity/deepracer-simapp" + echo "DR_SIMAPP_VERSION=${DEFAULT_SIMAPP_VERSION}-gpu" + return +fi + +DEPENDENCY_VERSION=$(jq -r '.master_version | select (.!=null)' $DIR/defaults/dependencies.json) + +SIMAPP_VER=$(docker inspect ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION} 2>/dev/null | jq -r .[].Config.Labels.version) +if [ -z "$SIMAPP_VER" ]; then SIMAPP_VER=$SIMAPP_VERSION; fi +if ! verlte $DEPENDENCY_VERSION $SIMAPP_VER; then + echo "WARNING: Incompatible version of Deepracer Simapp. Expected >$DEPENDENCY_VERSION. Got $SIMAPP_VER." +fi + +# Get Docker version +DOCKER_VERSION=$(docker --version | grep -oP '\d+\.\d+\.\d+' | head -1) +DR_DOCKER_MAJOR_VERSION=$(echo $DOCKER_VERSION | cut -d. -f1) +export DR_DOCKER_MAJOR_VERSION + +## Create a dr-local-aws command +alias dr-local-aws='aws $DR_LOCAL_PROFILE_ENDPOINT_URL' + +source $SCRIPT_DIR/scripts_wrapper.sh + +function dr-update { + dr-update-env +} + +function dr-reload { + source $DIR/bin/activate.sh $DR_CONFIG +} diff --git a/bin/autorun.sh b/bin/autorun.sh new file mode 100644 index 00000000..bcb2752c --- /dev/null +++ b/bin/autorun.sh @@ -0,0 +1,32 @@ +#!/usr/bin/env bash + +## this is the default autorun script +## file should run automatically after init.sh completes. +## this script downloads your configured run.env, system.env and any custom container requests + +INSTALL_DIR_TEMP="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." >/dev/null 2>&1 && pwd)" + +## retrieve the s3_location name you sent the instance in user data launch +## assumed to first line of file +S3_LOCATION=$(awk 'NR==1 {print; exit}' $INSTALL_DIR_TEMP/autorun.s3url) + +source $INSTALL_DIR_TEMP/bin/activate.sh + +## get the updatated run.env and system.env files and any others you stashed in s3 +aws s3 sync s3://$S3_LOCATION $INSTALL_DIR_TEMP + +## get the right docker containers, if needed +SYSENV="$INSTALL_DIR_TEMP/system.env" +SAGEMAKER_IMAGE=$(cat $SYSENV | grep DR_SAGEMAKER_IMAGE | sed 's/.*=//') +ROBOMAKER_IMAGE=$(cat $SYSENV | grep DR_ROBOMAKER_IMAGE | sed 's/.*=//') + +docker pull awsdeepracercommunity/deepracer-sagemaker:$SAGEMAKER_IMAGE +docker pull awsdeepracercommunity/deepracer-robomaker:$ROBOMAKER_IMAGE + +dr-reload + +date | tee $INSTALL_DIR_TEMP/DONE-AUTORUN + +## start training +cd $INSTALL_DIR_TEMP/scripts/training +./start.sh diff --git a/bin/detect.sh b/bin/detect.sh new file mode 100755 index 00000000..270e1284 --- /dev/null +++ b/bin/detect.sh @@ -0,0 +1,18 @@ +#!/usr/bin/env bash + +## What am I? +if [[ -f /var/run/cloud-init/instance-data.json ]]; then + # We have a cloud-init environment (Azure or AWS). + CLOUD_NAME=$(jq -r '.v1."cloud-name"' /var/run/cloud-init/instance-data.json) + if [[ "${CLOUD_NAME}" == "azure" ]]; then + export CLOUD_NAME + export CLOUD_INSTANCETYPE=$(jq -r '.ds."meta_data".imds.compute."vmSize"' /var/run/cloud-init/instance-data.json) + elif [[ "${CLOUD_NAME}" == "aws" ]]; then + export CLOUD_NAME + export CLOUD_INSTANCETYPE=$(jq -r '.ds."meta-data"."instance-type"' /var/run/cloud-init/instance-data.json) + else + export CLOUD_NAME=local + fi +else + export CLOUD_NAME=local +fi diff --git a/bin/init.sh b/bin/init.sh new file mode 100755 index 00000000..14fec1c1 --- /dev/null +++ b/bin/init.sh @@ -0,0 +1,249 @@ +#!/usr/bin/env bash + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" +INSTALL_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." >/dev/null 2>&1 && pwd)" + +if [[ "$INSTALL_DIR" == *\ * ]]; then + echo "Deepracer-for-Cloud cannot be installed in path with spaces. Exiting." + exit 1 +fi + +OPT_ARCH="gpu" +OPT_CLOUD="" +OPT_STYLE="swarm" + +while getopts ":m:c:a:s:" opt; do + case $opt in + a) + OPT_ARCH="$OPTARG" + ;; + m) + OPT_MOUNT="$OPTARG" + ;; + c) + OPT_CLOUD="$OPTARG" + ;; + s) + OPT_STYLE="$OPTARG" + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + exit 1 + ;; + esac +done + +if [[ -z "$OPT_CLOUD" ]]; then + source $SCRIPT_DIR/detect.sh + OPT_CLOUD=$CLOUD_NAME + echo "Detected cloud type to be $CLOUD_NAME" +fi + +# Find CPU Level +CPU_LEVEL="cpu" + +if [[ -f /proc/cpuinfo ]] && [[ "$(cat /proc/cpuinfo | grep avx2 | wc -l)" > 0 ]]; then + CPU_LEVEL="cpu" +elif [[ "$(type sysctl 2>/dev/null)" ]] && [[ "$(sysctl -n hw.optional.avx2_0)" == 1 ]]; then + CPU_LEVEL="cpu" +fi + +# Check if Intel (to ensure MKN) +if [[ -f /proc/cpuinfo ]] && [[ "$(cat /proc/cpuinfo | grep GenuineIntel | wc -l)" > 0 ]]; then + CPU_INTEL="true" +elif [[ "$(type sysctl 2>/dev/null)" ]] && [[ "$(sysctl -n machdep.cpu.vendor)" == "GenuineIntel" ]]; then + CPU_INTEL="true" +fi + +# Check GPU +if [ "$OPT_ARCH" = "gpu" ]; then + if GPUS="$(docker run --rm --gpus all --pull=missing \ + nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04 \ + bash -lc 'nvidia-smi -L | wc -l')" ; then + + if [ "${GPUS:-0}" -ge 1 ]; then + echo "Detected ${GPUS} GPU(s) inside docker." + else + echo "No GPU detected in docker. Using CPU" + OPT_ARCH="cpu" + fi + else + echo "Failed to run GPU test container. Using CPU" + OPT_ARCH="cpu" + fi +fi + +cd $INSTALL_DIR + +# create directory structure for docker volumes +mkdir -p $INSTALL_DIR/data $INSTALL_DIR/data/minio $INSTALL_DIR/data/minio/bucket +mkdir -p $INSTALL_DIR/data/logs $INSTALL_DIR/data/analysis $INSTALL_DIR/data/scripts $INSTALL_DIR/tmp +sudo mkdir -p /tmp/sagemaker +sudo chmod -R g+w /tmp/sagemaker + +# create symlink to current user's home .aws directory +# NOTE: AWS cli must be installed for this to work +# https://docs.aws.amazon.com/cli/latest/userguide/install-linux-al2017.html +mkdir -p $(eval echo "~${USER}")/.aws $INSTALL_DIR/docker/volumes/ +ln -sf $(eval echo "~${USER}")/.aws $INSTALL_DIR/docker/volumes/ + +# copy rewardfunctions +mkdir -p $INSTALL_DIR/custom_files +cp $INSTALL_DIR/defaults/hyperparameters.json $INSTALL_DIR/custom_files/ +cp $INSTALL_DIR/defaults/model_metadata.json $INSTALL_DIR/custom_files/ +cp $INSTALL_DIR/defaults/reward_function.py $INSTALL_DIR/custom_files/ + +cp $INSTALL_DIR/defaults/template-system.env $INSTALL_DIR/system.env +cp $INSTALL_DIR/defaults/template-run.env $INSTALL_DIR/run.env +if [[ "${OPT_CLOUD}" == "aws" ]]; then + IMDS_TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") + AWS_EC2_AVAIL_ZONE=$(curl -s -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/meta-data/placement/availability-zone) + AWS_REGION="$(echo $AWS_EC2_AVAIL_ZONE | sed 's/[a-z]$//')" + sed -i "s//not-defined/g" $INSTALL_DIR/system.env + sed -i "s//default/g" $INSTALL_DIR/system.env +elif [[ "${OPT_CLOUD}" == "remote" ]]; then + AWS_REGION="us-east-1" + sed -i "s//minio/g" $INSTALL_DIR/system.env + sed -i "s//not-defined/g" $INSTALL_DIR/system.env + echo "Please run 'aws configure --profile minio' to set the credentials" + echo "Please define DR_REMOTE_MINIO_URL in system.env to point to remote minio instance." +else + AWS_REGION="us-east-1" + MINIO_PROFILE="minio" + sed -i "s//$MINIO_PROFILE/g" $INSTALL_DIR/system.env + sed -i "s//not-defined/g" $INSTALL_DIR/system.env + + aws configure --profile $MINIO_PROFILE get aws_access_key_id >/dev/null 2>/dev/null + + if [[ "$?" -ne 0 ]]; then + echo "Creating default minio credentials in AWS profile '$MINIO_PROFILE'" + aws configure --profile $MINIO_PROFILE set aws_access_key_id $(openssl rand -base64 12) + aws configure --profile $MINIO_PROFILE set aws_secret_access_key $(openssl rand -base64 12) + aws configure --profile $MINIO_PROFILE set region us-east-1 + fi +fi +sed -i "s//to-be-defined/g" $INSTALL_DIR/system.env +sed -i "s//$OPT_CLOUD/g" $INSTALL_DIR/system.env +sed -i "s//$AWS_REGION/g" $INSTALL_DIR/system.env + +if [[ "${OPT_ARCH}" == "gpu" ]]; then + SAGEMAKER_TAG="gpu" +elif [[ -n "${CPU_INTEL}" ]]; then + SAGEMAKER_TAG="cpu" +else + SAGEMAKER_TAG="cpu" +fi + +#set proxys if required +for arg in "$@"; do + IFS='=' read -ra part <<<"$arg" + if [ "${part[0]}" == "--http_proxy" ] || [ "${part[0]}" == "--https_proxy" ] || [ "${part[0]}" == "--no_proxy" ]; then + var=${part[0]:2}=${part[1]} + args="${args} --build-arg ${var}" + fi +done + +# Download docker images. Change to build statements if locally built images are desired. +SIMAPP_VERSION=$(jq -r '.containers.simapp | select (.!=null)' $INSTALL_DIR/defaults/dependencies.json) +sed -i "s//$SIMAPP_VERSION-$SAGEMAKER_TAG/g" $INSTALL_DIR/system.env +docker pull awsdeepracercommunity/deepracer-simapp:$SIMAPP_VERSION-$SAGEMAKER_TAG + +# create the network sagemaker-local if it doesn't exit +SAGEMAKER_NW='sagemaker-local' + +if [[ "${OPT_STYLE}" == "swarm" ]]; then + + docker node ls >/dev/null 2>/dev/null + if [ $? -eq 0 ]; then + echo "Swarm exists. Exiting." + exit 1 + fi + + docker swarm init + if [ $? -ne 0 ]; then + + DEFAULT_IFACE=$(ip route | grep default | awk '{print $5}') + DEFAULT_IP=$(ip addr show $DEFAULT_IFACE | grep "inet\b" | awk '{print $2}' | cut -d/ -f1) + + if [ -z "$DEFAULT_IP" ]; then + echo "Could not determine default IP address. Exiting." + exit 1 + fi + + echo "Error when creating swarm, trying again with advertise address $DEFAULT_IP." + docker swarm init --advertise-addr $DEFAULT_IP + if [ $? -ne 0 ]; then + echo "Cound not create swarm. Exiting." + exit 1 + fi + fi + + SWARM_NODE=$(docker node inspect self | jq .[0].ID -r) + docker node update --label-add Sagemaker=true $SWARM_NODE >/dev/null 2>/dev/null + docker node update --label-add Robomaker=true $SWARM_NODE >/dev/null 2>/dev/null + docker network ls | grep -q $SAGEMAKER_NW + if [ $? -ne 0 ]; then + docker network create $SAGEMAKER_NW -d overlay --attachable --scope swarm + else + docker network rm $SAGEMAKER_NW + docker network create $SAGEMAKER_NW -d overlay --attachable --scope swarm --subnet=192.168.2.0/24 + fi + +elif [[ "${OPT_STYLE}" == "compose" ]]; then + + docker network ls | grep -q $SAGEMAKER_NW + if [ $? -ne 0 ]; then + docker network create $SAGEMAKER_NW + fi + +else + echo "Unknown docker style ${OPT_STYLE}. Exiting." + exit 1 +fi +sed -i "s//${OPT_STYLE}/g" $INSTALL_DIR/system.env + +# ensure our variables are set on startup - not for local setup. +if [[ "${OPT_CLOUD}" != "local" ]]; then + NUM_IN_PROFILE=$(cat $HOME/.profile | grep "$INSTALL_DIR/bin/activate.sh" | wc -l) + if [ "$NUM_IN_PROFILE" -eq 0 ]; then + echo "source $INSTALL_DIR/bin/activate.sh" >>$HOME/.profile + fi +fi + +# mark as done +date | tee $INSTALL_DIR/DONE + +## Optional auturun feature +# if using automation scripts to auto configure and run +# you must pass s3_training_location.txt to this instance in order for this to work +if [[ -f "$INSTALL_DIR/autorun.s3url" ]]; then + ## read in first line. first line always assumed to be training location regardless what else is in file + TRAINING_LOC=$(awk 'NR==1 {print; exit}' $INSTALL_DIR/autorun.s3url) + + #get bucket name + TRAINING_BUCKET=${TRAINING_LOC%%/*} + #get prefix. minor exception handling in case there is no prefix and a root bucket is passed + if [[ "$TRAINING_LOC" == *"/"* ]]; then + TRAINING_PREFIX=${TRAINING_LOC#*/} + else + TRAINING_PREFIX="" + fi + + ##check if custom autorun script exists in s3 training bucket. If not, use default in this repo + aws s3api head-object --bucket $TRAINING_BUCKET --key $TRAINING_PREFIX/autorun.sh || not_exist=true + if [ $not_exist ]; then + echo "custom file does not exist, using local copy" + else + echo "custom script does exist, use it" + aws s3 cp s3://$TRAINING_LOC/autorun.sh $INSTALL_DIR/bin/autorun.sh + fi + chmod +x $INSTALL_DIR/bin/autorun.sh + bash -c "source $INSTALL_DIR/bin/autorun.sh" +fi diff --git a/bin/prepare.sh b/bin/prepare.sh new file mode 100755 index 00000000..0c6ec8fb --- /dev/null +++ b/bin/prepare.sh @@ -0,0 +1,155 @@ +#!/usr/bin/env bash + +set -euo pipefail +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +export DEBIAN_FRONTEND=noninteractive +DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" + +# Only allow supported Ubuntu versions +. /etc/os-release +SUPPORTED_VERSIONS=("22.04" "24.04" "24.10" "25.04" "25.10") +DISTRIBUTION=${ID}${VERSION_ID//./} +UBUNTU_MAJOR_VERSION=$(echo $VERSION_ID | cut -d. -f1) +UBUNTU_MINOR_VERSION=$(echo $VERSION_ID | cut -d. -f2) +if [[ "$ID" == "ubuntu" ]]; then + VERSION_OK=false + for V in "${SUPPORTED_VERSIONS[@]}"; do + if [[ "$VERSION_ID" == "$V" ]]; then + VERSION_OK=true + break + fi + done + if [[ "$VERSION_OK" != true ]]; then + echo "ERROR: Ubuntu $VERSION_ID is not a supported version. Supported versions: ${SUPPORTED_VERSIONS[*]}" + exit 1 + fi +fi + +## Check if WSL2 +IS_WSL2="" +if grep -qi Microsoft /proc/version && grep -q "WSL2" /proc/version; then + IS_WSL2="yes" +fi + +# Remove needrestart in all Ubuntu 2x.04/2x.10+ (future-proof) +if [[ "${ID}" == "ubuntu" && ${UBUNTU_MAJOR_VERSION} -ge 22 && -z "${IS_WSL2}" ]]; then + sudo apt remove -y needrestart || true +fi + +## Patch system +sudo apt update && sudo apt-mark hold grub-pc && sudo apt -y -o \ + DPkg::options::="--force-confdef" -o DPkg::options::="--force-confold" -qq upgrade + +## Install required packages +sudo apt install --no-install-recommends -y jq python3-boto3 screen git curl + +## Install AWS CLI +if [[ "${ID}" == "ubuntu" && ( ${UBUNTU_MAJOR_VERSION} -eq 22 ) ]]; then + sudo apt install -y awscli +else + if command -v snap >/dev/null 2>&1; then + sudo snap install aws-cli --classic + else + echo "WARNING: snap not available, AWS CLI not installed" + fi +fi + +## Detect cloud +source $DIR/detect.sh +echo "Detected cloud type ${CLOUD_NAME}" + +## Do I have a GPU +GPUS=0 +if [[ -z "${IS_WSL2}" ]]; then + GPUS=$(lspci | awk '/NVIDIA/ && ( /VGA/ || /3D controller/ ) ' | wc -l) +else + if [[ -f /usr/lib/wsl/lib/nvidia-smi ]]; then + GPUS=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) + fi +fi +if [ $? -ne 0 ] || [ $GPUS -eq 0 ]; then + ARCH="cpu" + echo "No NVIDIA GPU detected. Will not install drivers." +else + ARCH="gpu" +fi + +## Adding Nvidia Drivers +if [[ "${ARCH}" == "gpu" && -z "${IS_WSL2}" ]]; then + DRIVER_OK=false + # Find all installed nvidia-driver-XXX packages (status 'ii'), extract version, and check if >= 525 + for PKG in $(dpkg -l | awk '$1 == "ii" && /nvidia-driver-[0-9]+/ {print $2}'); do + DRIVER_VER=$(echo "${PKG}" | sed -E 's/nvidia-driver-([0-9]+).*/\1/') + if [[ ${DRIVER_VER} -ge 560 ]]; then + echo "NVIDIA driver ${DRIVER_VER} already installed." + DRIVER_OK=true + break + fi + done + if [[ "${DRIVER_OK}" != true ]]; then + # Try to install the highest available driver >= 560 + HIGHEST_DRIVER=$(apt-cache search --names-only '^nvidia-driver-[0-9]+$' | awk '{print $1}' | grep -oE '[0-9]+$' | awk '$1 >= 560' | sort -nr | head -n1) + if [[ -n "${HIGHEST_DRIVER}" ]]; then + sudo apt install -y "nvidia-driver-${HIGHEST_DRIVER}" --no-install-recommends -o Dpkg::Options::="--force-overwrite" + elif apt-cache show nvidia-driver-560-server &>/dev/null; then + sudo apt install -y nvidia-driver-560-server --no-install-recommends -o Dpkg::Options::="--force-overwrite" + else + echo "No supported NVIDIA driver >= 560 found for this Ubuntu version." + exit 1 + fi + fi +fi + +## Installing Docker +sudo apt install -y --no-install-recommends docker.io docker-buildx docker-compose-v2 + +## Install Nvidia Docker Container +if [[ "${ARCH}" == "gpu" ]]; then + curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && + curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | + sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list + + sudo apt update && sudo apt install -y --no-install-recommends nvidia-docker2 nvidia-container-runtime + if [ -f "/etc/docker/daemon.json" ]; then + echo "Altering /etc/docker/daemon.json with default-runtime nvidia." + cat /etc/docker/daemon.json | jq 'del(."default-runtime") + {"default-runtime": "nvidia"}' | sudo tee /etc/docker/daemon.json + else + echo "Creating /etc/docker/daemon.json with default-runtime nvidia." + sudo cp "${DIR}/../defaults/docker-daemon.json" /etc/docker/daemon.json + fi +fi + +## Enable and start docker +if [[ -n "${IS_WSL2}" ]]; then + sudo service docker restart +else + sudo systemctl enable docker + sudo systemctl restart docker +fi + +## Ensure user can run docker +sudo usermod -a -G docker "$(id -un)" + +## Reboot to load driver -- continue install if in cloud-init +CLOUD_INIT=$(pstree -s $BASHPID | awk /cloud-init/ | wc -l) + +if [[ "${CLOUD_INIT}" -ne 0 ]]; then + echo "Rebooting in 5 seconds. Will continue with install." + cd "${DIR}" + ./runonce.sh "./init.sh -c ${CLOUD_NAME} -a ${ARCH}" + sleep 5s + sudo shutdown -r +1 +elif [[ -n "${IS_WSL2}" || "${ARCH}" == "cpu" ]]; then + echo "First stage done. Log out, then log back in and run init.sh -c ${CLOUD_NAME} -a ${ARCH}" + echo "Note: You may need to log out and back in for docker group membership to take effect." +else + echo "First stage done. Please reboot and run init.sh -c ${CLOUD_NAME} -a ${ARCH}" + echo "Note: Reboot is required for NVIDIA drivers and docker group membership to take effect." +fi diff --git a/bin/runonce.sh b/bin/runonce.sh new file mode 100755 index 00000000..167a2df6 --- /dev/null +++ b/bin/runonce.sh @@ -0,0 +1,49 @@ +#!/bin/bash + +if [[ $# -eq 0 ]]; then + echo "Schedules a command to be run after the next reboot." + echo "Usage: $(basename $0) " + echo " $(basename $0) -p " + echo " $(basename $0) -r " +else + REMOVE=0 + COMMAND=${!#} + SCRIPTPATH=$PATH + + while getopts ":r:p:" optionName; do + case "$optionName" in + r) + REMOVE=1 + COMMAND=$OPTARG + ;; + p) SCRIPTPATH=$OPTARG ;; + esac + done + + SCRIPT="${HOME}/.$(basename $0)_$(echo $COMMAND | sed 's/[^a-zA-Z0-9_]/_/g')" + + if [[ ! -f $SCRIPT ]]; then + echo "PATH=$SCRIPTPATH" >>$SCRIPT + echo "cd $(pwd)" >>$SCRIPT + echo "logger -t $(basename $0) -p local3.info \"COMMAND=$COMMAND ; USER=\$(whoami) ($(logname)) ; PWD=$(pwd) ; PATH=\$PATH\"" >>$SCRIPT + echo "$COMMAND | logger -t $(basename $0) -p local3.info" >>$SCRIPT + echo "$0 -r \"$(echo $COMMAND | sed 's/\"/\\\"/g')\"" >>$SCRIPT + chmod +x $SCRIPT + fi + + CRONTAB="${HOME}/.$(basename $0)_temp_crontab_$RANDOM" + ENTRY="@reboot $SCRIPT" + + echo "$(crontab -l 2>/dev/null)" | grep -v "$ENTRY" | grep -v "^# DO NOT EDIT THIS FILE - edit the master and reinstall.$" | grep -v "^# ([^ ]* installed on [^)]*)$" | grep -v "^# (Cron version [^$]*\$[^$]*\$)$" >$CRONTAB + + if [[ $REMOVE -eq 0 ]]; then + echo "$ENTRY" >>$CRONTAB + fi + + crontab $CRONTAB + rm $CRONTAB + + if [[ $REMOVE -ne 0 ]]; then + rm $SCRIPT + fi +fi diff --git a/bin/scripts_wrapper.sh b/bin/scripts_wrapper.sh new file mode 100644 index 00000000..b888d5d8 --- /dev/null +++ b/bin/scripts_wrapper.sh @@ -0,0 +1,331 @@ +#!/bin/bash + +function dr-upload-custom-files { + eval CUSTOM_TARGET=$(echo s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_CUSTOM_FILES_PREFIX/) + echo "Uploading files to $CUSTOM_TARGET" + aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 sync $DR_DIR/custom_files/ $CUSTOM_TARGET +} + +function dr-upload-model { + dr-update-env && ${DR_DIR}/scripts/upload/upload-model.sh "$@" +} + +function dr-download-model { + dr-update-env && ${DR_DIR}/scripts/upload/download-model.sh "$@" +} + +function dr-upload-car-zip { + dr-update-env && ${DR_DIR}/scripts/upload/upload-car.sh "$@" +} + +function dr-list-aws-models { + echo "Due to changes in AWS DeepRacer Console this command is no longer available." +} + +function dr-set-upload-model { + echo "Due to changes in AWS DeepRacer Console this command is no longer available." +} + +function dr-increment-upload-model { + dr-update-env && ${DR_DIR}/scripts/upload/increment.sh "$@" && dr-update-env +} + +function dr-download-custom-files { + eval CUSTOM_TARGET=$(echo s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_CUSTOM_FILES_PREFIX/) + echo "Downloading files from $CUSTOM_TARGET" + aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 sync $CUSTOM_TARGET $DR_DIR/custom_files/ +} + +function dr-start-training { + dr-update-env + $DR_DIR/scripts/training/start.sh "$@" +} + +function dr-increment-training { + dr-update-env && ${DR_DIR}/scripts/training/increment.sh "$@" && dr-update-env +} + +function dr-stop-training { + bash -c "cd $DR_DIR/scripts/training && ./stop.sh" +} + +function dr-start-evaluation { + dr-update-env + $DR_DIR/scripts/evaluation/start.sh "$@" +} + +function dr-stop-evaluation { + bash -c "cd $DR_DIR/scripts/evaluation && ./stop.sh" +} + +function dr-start-tournament { + echo "Tournaments are no longer supported. Use Head-to-Model evaluation instead." +} + +function dr-start-loganalysis { + bash -c "cd $DR_DIR/scripts/log-analysis && ./start.sh" +} + +function dr-stop-loganalysis { + eval LOG_ANALYSIS_ID=$(docker ps | awk ' /deepracer-analysis/ { print $1 }') + if [ -n "$LOG_ANALYSIS_ID" ]; then + bash -c "cd $DR_DIR/scripts/log-analysis && ./stop.sh" + else + echo "Log-analysis is not running." + fi + +} + +function dr-logs-sagemaker { + + local OPTIND + OPT_TIME="--since 5m" + + while getopts ":w:a" opt; do + case $opt in + w) + OPT_WAIT=$OPTARG + ;; + a) + OPT_TIME="" + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + ;; + esac + done + + SAGEMAKER_CONTAINER=$(dr-find-sagemaker) + + if [[ -z "$SAGEMAKER_CONTAINER" ]]; then + if [[ -n "$OPT_WAIT" ]]; then + WAIT_TIME=$OPT_WAIT + echo "Waiting up to $WAIT_TIME seconds for Sagemaker to start up..." + until [ -n "$SAGEMAKER_CONTAINER" ]; do + sleep 1 + ((WAIT_TIME--)) + if [ "$WAIT_TIME" -lt 1 ]; then + echo "Sagemaker is not running." + return 1 + fi + SAGEMAKER_CONTAINER=$(dr-find-sagemaker) + done + else + echo "Sagemaker is not running." + return 1 + fi + fi + + if [[ "$TERM_PROGRAM" == "vscode" ]]; then + echo "VS Code terminal detected. Displaying Sagemaker logs inline." + docker logs $OPT_TIME -f $SAGEMAKER_CONTAINER + elif [[ "${DR_HOST_X,,}" == "true" && -n "$DISPLAY" ]]; then + if [ -x "$(command -v gnome-terminal)" ]; then + gnome-terminal --tab --title "DR-${DR_RUN_ID}: Sagemaker - ${SAGEMAKER_CONTAINER}" -- /usr/bin/bash -c "docker logs $OPT_TIME -f ${SAGEMAKER_CONTAINER}" 2>/dev/null + echo "Sagemaker container $SAGEMAKER_CONTAINER logs opened in separate gnome-terminal. " + elif [ -x "$(command -v x-terminal-emulator)" ]; then + x-terminal-emulator -e /bin/sh -c "docker logs $OPT_TIME -f ${SAGEMAKER_CONTAINER}" 2>/dev/null + echo "Sagemaker container $SAGEMAKER_CONTAINER logs opened in separate terminal. " + else + echo 'Could not find a terminal emulator. Displaying inline.' + docker logs $OPT_TIME -f $SAGEMAKER_CONTAINER + fi + else + docker logs $OPT_TIME -f $SAGEMAKER_CONTAINER + fi + +} + +function dr-find-sagemaker { + + STACK_NAME="deepracer-$DR_RUN_ID" + RUN_NAME=${DR_LOCAL_S3_MODEL_PREFIX} + + SAGEMAKER_CONTAINERS=$(docker ps | awk ' /simapp/ { print $1 } ' | xargs) + + if [[ -n "$SAGEMAKER_CONTAINERS" ]]; then + for CONTAINER in $SAGEMAKER_CONTAINERS; do + CONTAINER_NAME=$(docker ps --format '{{.Names}}' --filter id=$CONTAINER) + CONTAINER_PREFIX=$(echo $CONTAINER_NAME | perl -n -e'/(.*)-(algo-(.)-(.*))/; print $1') + COMPOSE_SERVICE_NAME=$(echo $CONTAINER_NAME | perl -n -e'/(.*)-(algo-(.)-(.*))/; print $2') + + if [[ -n "$COMPOSE_SERVICE_NAME" ]]; then + COMPOSE_FILES=$(sudo find /tmp/sagemaker -name docker-compose.yaml -exec grep -l "$COMPOSE_SERVICE_NAME" {} +) + for COMPOSE_FILE in $COMPOSE_FILES; do + if sudo grep -q "RUN_ID=${DR_RUN_ID}" $COMPOSE_FILE && sudo grep -q "${RUN_NAME}" $COMPOSE_FILE; then + echo $CONTAINER + fi + done + fi + done + fi + +} + +function dr-logs-robomaker { + + OPT_REPLICA=1 + OPT_EVAL="" + local OPTIND + OPT_TIME="--since 5m" + + while getopts ":w:n:ea" opt; do + case $opt in + w) + OPT_WAIT=$OPTARG + ;; + n) + OPT_REPLICA=$OPTARG + ;; + e) + OPT_EVAL="-e" + ;; + a) + OPT_TIME="" + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + ;; + esac + done + + ROBOMAKER_CONTAINER=$(dr-find-robomaker -n ${OPT_REPLICA} ${OPT_EVAL}) + + if [[ -z "$ROBOMAKER_CONTAINER" ]]; then + if [[ -n "$OPT_WAIT" ]]; then + WAIT_TIME=$OPT_WAIT + echo "Waiting up to $WAIT_TIME seconds for Robomaker #${OPT_REPLICA} to start up..." + until [ -n "$ROBOMAKER_CONTAINER" ]; do + sleep 1 + ((WAIT_TIME--)) + if [ "$WAIT_TIME" -lt 1 ]; then + echo "Robomaker #${OPT_REPLICA} is not running." + return 1 + fi + ROBOMAKER_CONTAINER=$(dr-find-robomaker -n ${OPT_REPLICA} ${OPT_EVAL}) + done + else + echo "Robomaker #${OPT_REPLICA} is not running." + return 1 + fi + fi + + if [[ "$TERM_PROGRAM" == "vscode" ]]; then + echo "VS Code terminal detected. Displaying Robomaker #${OPT_REPLICA} logs inline." + docker logs $OPT_TIME -f $ROBOMAKER_CONTAINER + elif [[ "${DR_HOST_X,,}" == "true" && -n "$DISPLAY" ]]; then + if [ -x "$(command -v gnome-terminal)" ]; then + gnome-terminal --tab --title "DR-${DR_RUN_ID}: Robomaker #${OPT_REPLICA} - ${ROBOMAKER_CONTAINER}" -- /usr/bin/bash -c "docker logs $OPT_TIME -f ${ROBOMAKER_CONTAINER}" 2>/dev/null + echo "Robomaker #${OPT_REPLICA} ($ROBOMAKER_CONTAINER) logs opened in separate gnome-terminal. " + elif [ -x "$(command -v x-terminal-emulator)" ]; then + x-terminal-emulator -e /bin/sh -c "docker logs $OPT_TIME -f ${ROBOMAKER_CONTAINER}" 2>/dev/null + echo "Robomaker #${OPT_REPLICA} ($ROBOMAKER_CONTAINER) logs opened in separate terminal. " + else + echo 'Could not find a terminal emulator. Displaying inline.' + docker logs $OPT_TIME -f $ROBOMAKER_CONTAINER + fi + else + docker logs $OPT_TIME -f $ROBOMAKER_CONTAINER + fi + +} + +function dr-find-robomaker { + + local OPTIND + + OPT_PREFIX="deepracer" + + while getopts ":n:e" opt; do + case $opt in + n) + OPT_REPLICA=$OPTARG + ;; + e) + OPT_PREFIX="-eval" + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + ;; + esac + done + + if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + eval ROBOMAKER_ID=$(docker ps | grep "${OPT_PREFIX}-${DR_RUN_ID}_robomaker.${OPT_REPLICA}" | cut -f1 -d\ | head -1) + else + eval ROBOMAKER_ID=$(docker ps | grep "${OPT_PREFIX}-${DR_RUN_ID}-robomaker-${OPT_REPLICA}" | cut -f1 -d\ | head -1) + fi + + if [ -n "$ROBOMAKER_ID" ]; then + echo $ROBOMAKER_ID + fi +} + +function dr-get-robomaker-stats { + + local OPTIND + OPT_REPLICA=1 + + while getopts ":n:" opt; do + case $opt in + n) + OPT_REPLICA=$OPTARG + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + ;; + esac + done + + eval ROBOMAKER_ID=$(dr-find-robomaker -n $OPT_REPLICA) + if [ -n "$ROBOMAKER_ID" ]; then + echo "Showing statistics for Robomaker #$OPT_REPLICA - container $ROBOMAKER_ID" + docker exec -ti $ROBOMAKER_ID bash -c "gz stats" + else + echo "Robomaker #$OPT_REPLICA is not running." + fi +} + +function dr-logs-loganalysis { + eval LOG_ANALYSIS_ID=$(docker ps | awk ' /deepracer-analysis/ { print $1 }') + if [ -n "$LOG_ANALYSIS_ID" ]; then + docker logs -f $LOG_ANALYSIS_ID + else + echo "Log-analysis is not running." + fi + +} + +function dr-url-loganalysis { + eval LOG_ANALYSIS_ID=$(docker ps | awk ' /deepracer-analysis/ { print $1 }') + if [ -n "$LOG_ANALYSIS_ID" ]; then + docker exec "$LOG_ANALYSIS_ID" bash -c "jupyter server list" + else + echo "Log-analysis is not running." + fi +} + +function dr-view-stream { + ${DR_DIR}/utils/start-local-browser.sh "$@" +} + +function dr-start-viewer { + $DR_DIR/scripts/viewer/start.sh "$@" +} + +function dr-stop-viewer { + $DR_DIR/scripts/viewer/stop.sh "$@" +} + +function dr-update-viewer { + $DR_DIR/scripts/viewer/stop.sh "$@" + $DR_DIR/scripts/viewer/start.sh "$@" +} + +function dr-start-metrics { + $DR_DIR/scripts/metrics/start.sh "$@" +} + +function dr-stop-metrics { + $DR_DIR/scripts/metrics/stop.sh "$@" +} \ No newline at end of file diff --git a/defaults/debug-reward_function.py b/defaults/debug-reward_function.py new file mode 100644 index 00000000..b4d82925 --- /dev/null +++ b/defaults/debug-reward_function.py @@ -0,0 +1,59 @@ +import math +import numpy +import time + +class Reward: + + ''' + Debugging reward function to be used to track performance of local training. + Will print out the Real-Time-Factor (RTF), as well as how many + steps-per-second (sim-time) that the system is able to deliver. + ''' + + def __init__(self, verbose=False, track_time=False): + self.verbose = verbose + self.track_time = track_time + + if track_time: + TIME_WINDOW=10 + self.time = numpy.zeros([TIME_WINDOW, 2]) + + if verbose: + print("Initializing Reward Class") + + def get_time(self): + + wall_time_incr = numpy.max(self.time[:,0]) - numpy.min(self.time[:,0]) + sim_time_incr = numpy.max(self.time[:,1]) - numpy.min(self.time[:,1]) + + rtf = sim_time_incr / wall_time_incr + fps = (self.time.shape[0] - 1) / sim_time_incr + + return rtf, fps + + def record_time(self, steps, sim_time=0.0): + + index = int(steps) % self.time.shape[0] + self.time[index,0] = time.time() + self.time[index,1] = sim_time + + def reward_function(self, params): + + # Read input parameters + steps = params["steps"] + + if self.track_time: + self.record_time(steps, sim_time=params.get("sim_time", 0.0)) + + if self.track_time: + if steps >= self.time.shape[0]: + rtf, fps = self.get_time() + print("TIME: s: {}, rtf: {}, fps:{}".format(int(steps), round(rtf, 2), round(fps, 2) )) + + return 1.0 + + +reward_object = Reward(verbose=False, track_time=True) + +def reward_function(params): + return reward_object.reward_function(params) diff --git a/defaults/dependencies.json b/defaults/dependencies.json new file mode 100644 index 00000000..90e714cd --- /dev/null +++ b/defaults/dependencies.json @@ -0,0 +1,6 @@ +{ + "master_version": "6.0", + "containers": { + "simapp": "6.0.2" + } +} diff --git a/defaults/docker-daemon.json b/defaults/docker-daemon.json new file mode 100644 index 00000000..c0fc2e4b --- /dev/null +++ b/defaults/docker-daemon.json @@ -0,0 +1,9 @@ +{ + "runtimes": { + "nvidia": { + "path": "nvidia-container-runtime", + "runtimeArgs": [] + } + }, + "default-runtime": "nvidia" +} \ No newline at end of file diff --git a/defaults/hyperparameters.json b/defaults/hyperparameters.json new file mode 100644 index 00000000..6f2730be --- /dev/null +++ b/defaults/hyperparameters.json @@ -0,0 +1,16 @@ +{ + "batch_size": 64, + "beta_entropy": 0.01, + "discount_factor": 0.99, + "e_greedy_value": 0.05, + "epsilon_steps": 10000, + "exploration_type": "categorical", + "loss_type": "huber", + "lr": 0.0003, + "num_episodes_between_training": 20, + "num_epochs": 5, + "stack_size": 1, + "term_cond_avg_score": 350.0, + "term_cond_max_episodes": 1000, + "sac_alpha": 0.2 + } \ No newline at end of file diff --git a/defaults/model_metadata.json b/defaults/model_metadata.json new file mode 100644 index 00000000..3d028c48 --- /dev/null +++ b/defaults/model_metadata.json @@ -0,0 +1,29 @@ +{ + "action_space": [ + { + "steering_angle": -30, + "speed": 0.6 + }, + { + "steering_angle": -15, + "speed": 0.6 + }, + { + "steering_angle": 0, + "speed": 0.6 + }, + { + "steering_angle": 15, + "speed": 0.6 + }, + { + "steering_angle": 30, + "speed": 0.6 + } + ], + "sensor": ["FRONT_FACING_CAMERA"], + "neural_network": "DEEP_CONVOLUTIONAL_NETWORK_SHALLOW", + "training_algorithm": "clipped_ppo", + "action_space_type": "discrete", + "version": "5" +} diff --git a/defaults/model_metadata_cont.json b/defaults/model_metadata_cont.json new file mode 100644 index 00000000..aa20d314 --- /dev/null +++ b/defaults/model_metadata_cont.json @@ -0,0 +1,19 @@ +{ + "action_space": { + "speed": { + "high": 2, + "low": 1 + }, + "steering_angle": { + "high": 30, + "low": -30 + } + }, + "sensor": [ + "FRONT_FACING_CAMERA" + ], + "neural_network": "DEEP_CONVOLUTIONAL_NETWORK_SHALLOW", + "training_algorithm": "clipped_ppo", + "action_space_type": "continuous", + "version": "5" +} \ No newline at end of file diff --git a/defaults/model_metadata_sac.json b/defaults/model_metadata_sac.json new file mode 100644 index 00000000..07c8ac04 --- /dev/null +++ b/defaults/model_metadata_sac.json @@ -0,0 +1,8 @@ +{ + "action_space": {"speed": {"high": 2, "low": 1}, "steering_angle": {"high": 30, "low": -30}}, + "sensor": ["FRONT_FACING_CAMERA"], + "neural_network": "DEEP_CONVOLUTIONAL_NETWORK_SHALLOW", + "training_algorithm": "sac", + "action_space_type": "continuous", + "version": "4" +} diff --git a/defaults/reward_function.py b/defaults/reward_function.py new file mode 100644 index 00000000..3a022ade --- /dev/null +++ b/defaults/reward_function.py @@ -0,0 +1,33 @@ +def reward_function(params): + ''' + Example of penalize steering, which helps mitigate zig-zag behaviors + ''' + + # Read input parameters + distance_from_center = params['distance_from_center'] + track_width = params['track_width'] + steering = abs(params['steering_angle']) # Only need the absolute steering angle + + # Calculate 3 marks that are farther and father away from the center line + marker_1 = 0.1 * track_width + marker_2 = 0.25 * track_width + marker_3 = 0.5 * track_width + + # Give higher reward if the car is closer to center line and vice versa + if distance_from_center <= marker_1: + reward = 1 + elif distance_from_center <= marker_2: + reward = 0.5 + elif distance_from_center <= marker_3: + reward = 0.1 + else: + reward = 1e-3 # likely crashed/ close to off track + + # Steering penality threshold, change the number based on your action space setting + ABS_STEERING_THRESHOLD = 15 + + # Penalize reward if the car is steering too much + if steering > ABS_STEERING_THRESHOLD: + reward *= 0.8 + + return float(reward) diff --git a/defaults/template-run.env b/defaults/template-run.env new file mode 100644 index 00000000..654802ec --- /dev/null +++ b/defaults/template-run.env @@ -0,0 +1,62 @@ +DR_RUN_ID=0 +DR_WORLD_NAME=reinvent_base +DR_RACE_TYPE=TIME_TRIAL +DR_CAR_NAME=FastCar +DR_CAR_BODY_SHELL_TYPE=deepracer +DR_CAR_COLOR=Red +DR_DISPLAY_NAME=$DR_CAR_NAME +DR_RACER_NAME=$DR_CAR_NAME +DR_ENABLE_DOMAIN_RANDOMIZATION=False +DR_EVAL_NUMBER_OF_TRIALS=3 +DR_EVAL_IS_CONTINUOUS=True +DR_EVAL_MAX_RESETS=100 +DR_EVAL_OFF_TRACK_PENALTY=5.0 +DR_EVAL_COLLISION_PENALTY=5.0 +DR_EVAL_SAVE_MP4=False +DR_EVAL_CHECKPOINT=last +DR_EVAL_OPP_S3_MODEL_PREFIX=rl-deepracer-sagemaker +DR_EVAL_OPP_CAR_BODY_SHELL_TYPE=deepracer +DR_EVAL_OPP_CAR_NAME=FasterCar +DR_EVAL_OPP_DISPLAY_NAME=$DR_EVAL_OPP_CAR_NAME +DR_EVAL_OPP_RACER_NAME=$DR_EVAL_OPP_CAR_NAME +DR_EVAL_DEBUG_REWARD=False +DR_EVAL_RESET_BEHIND_DIST=1.0 +DR_EVAL_REVERSE_DIRECTION=False +#DR_EVAL_RTF=1.0 +DR_TRAIN_CHANGE_START_POSITION=True +DR_TRAIN_REVERSE_DIRECTION=False +DR_TRAIN_ALTERNATE_DRIVING_DIRECTION=False +DR_TRAIN_START_POSITION_OFFSET=0.0 +DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST=0.05 +DR_TRAIN_MULTI_CONFIG=False +DR_TRAIN_MIN_EVAL_TRIALS=5 +DR_TRAIN_BEST_MODEL_METRIC=progress +#DR_TRAIN_RTF=1.0 +#DR_TRAIN_MAX_STEPS_PER_ITERATION=10000 +DR_LOCAL_S3_MODEL_PREFIX=rl-deepracer-sagemaker +DR_LOCAL_S3_PRETRAINED=False +DR_LOCAL_S3_PRETRAINED_PREFIX=rl-sagemaker-pretrained +DR_LOCAL_S3_PRETRAINED_CHECKPOINT=last +DR_LOCAL_S3_CUSTOM_FILES_PREFIX=custom_files +DR_LOCAL_S3_TRAINING_PARAMS_FILE=training_params.yaml +DR_LOCAL_S3_EVAL_PARAMS_FILE=evaluation_params.yaml +DR_LOCAL_S3_MODEL_METADATA_KEY=$DR_LOCAL_S3_CUSTOM_FILES_PREFIX/model_metadata.json +DR_LOCAL_S3_HYPERPARAMETERS_KEY=$DR_LOCAL_S3_CUSTOM_FILES_PREFIX/hyperparameters.json +DR_LOCAL_S3_REWARD_KEY=$DR_LOCAL_S3_CUSTOM_FILES_PREFIX/reward_function.py +DR_LOCAL_S3_METRICS_PREFIX=$DR_LOCAL_S3_MODEL_PREFIX/metrics +DR_UPLOAD_S3_PREFIX=$DR_LOCAL_S3_MODEL_PREFIX-1 +DR_OA_NUMBER_OF_OBSTACLES=6 +DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES=2.0 +DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS=False +DR_OA_IS_OBSTACLE_BOT_CAR=False +DR_OA_OBSTACLE_TYPE=box_obstacle +DR_OA_OBJECT_POSITIONS= +DR_H2B_IS_LANE_CHANGE=False +DR_H2B_LOWER_LANE_CHANGE_TIME=3.0 +DR_H2B_UPPER_LANE_CHANGE_TIME=5.0 +DR_H2B_LANE_CHANGE_DISTANCE=1.0 +DR_H2B_NUMBER_OF_BOT_CARS=3 +DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS=2.0 +DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS=False +DR_H2B_BOT_CAR_SPEED=0.2 +DR_H2B_BOT_CAR_PENALTY=5.0 \ No newline at end of file diff --git a/defaults/template-system.env b/defaults/template-system.env new file mode 100644 index 00000000..14f3348e --- /dev/null +++ b/defaults/template-system.env @@ -0,0 +1,32 @@ +DR_CLOUD= +DR_AWS_APP_REGION= +DR_UPLOAD_S3_PROFILE=default +DR_UPLOAD_S3_BUCKET= +DR_UPLOAD_S3_ROLE= +DR_LOCAL_S3_BUCKET=bucket +DR_LOCAL_S3_PROFILE= +DR_GUI_ENABLE=False +DR_KINESIS_STREAM_NAME= +DR_CAMERA_MAIN_ENABLE=True +DR_CAMERA_SUB_ENABLE=False +DR_CAMERA_KVS_ENABLE=True +DR_ENABLE_EXTRA_KVS_OVERLAY=False +DR_SIMAPP_SOURCE=awsdeepracercommunity/deepracer-simapp +DR_SIMAPP_VERSION= +DR_MINIO_IMAGE=latest +DR_ANALYSIS_IMAGE=cpu +DR_WORKERS=1 +DR_ROBOMAKER_MOUNT_LOGS=False +# DR_ROBOMAKER_MOUNT_SIMAPP_DIR= +# DR_ROBOMAKER_MOUNT_SCRIPTS_DIR=${DR_DIR}/data/scripts +DR_CLOUD_WATCH_ENABLE=False +DR_CLOUD_WATCH_LOG_STREAM_PREFIX= +DR_DOCKER_STYLE= +DR_HOST_X=False +DR_WEBVIEWER_PORT=8100 +# DR_DISPLAY=:99 +# DR_REMOTE_MINIO_URL=http://mynas:9000 +# DR_ROBOMAKER_CUDA_DEVICES=0 +# DR_SAGEMAKER_CUDA_DEVICES=0 +# DR_TELEGRAF_HOST=telegraf +# DR_TELEGRAF_PORT=8092 \ No newline at end of file diff --git a/defaults/template-worker.env b/defaults/template-worker.env new file mode 100644 index 00000000..863ae773 --- /dev/null +++ b/defaults/template-worker.env @@ -0,0 +1,22 @@ +DR_WORLD_NAME=reInvent2019_track +DR_RACE_TYPE=TIME_TRIAL +DR_CAR_COLOR=Blue +DR_ENABLE_DOMAIN_RANDOMIZATION=False +DR_TRAIN_CHANGE_START_POSITION=True +DR_TRAIN_ALTERNATE_DRIVING_DIRECTION=False +DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST=0.05 +DR_TRAIN_START_POSITION_OFFSET=0.0 +DR_OA_NUMBER_OF_OBSTACLES=6 +DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES=2.0 +DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS=False +DR_OA_IS_OBSTACLE_BOT_CAR=False +DR_OA_OBSTACLE_TYPE=box_obstacle +DR_OA_OBJECT_POSITIONS= +DR_H2B_IS_LANE_CHANGE=False +DR_H2B_LOWER_LANE_CHANGE_TIME=3.0 +DR_H2B_UPPER_LANE_CHANGE_TIME=5.0 +DR_H2B_LANE_CHANGE_DISTANCE=1.0 +DR_H2B_NUMBER_OF_BOT_CARS=3 +DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS=2.0 +DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS=False +DR_H2B_BOT_CAR_SPEED=0.2 diff --git a/docker/.env b/docker/.env deleted file mode 100644 index da93ce80..00000000 --- a/docker/.env +++ /dev/null @@ -1,34 +0,0 @@ -WORLD_NAME=AWS_track -LOCAL_ENV_VAR_JSON_PATH=env_vars.json -MINIO_ACCESS_KEY=minio -MINIO_SECRET_KEY=miniokey -AWS_ACCESS_KEY_ID=minio -AWS_SECRET_ACCESS_KEY=miniokey -AWS_DEFAULT_REGION=us-east-1 -S3_ENDPOINT_URL=http://minio:9000 -ROS_AWS_REGION=us-east-1 -AWS_REGION=us-east-1 -MODEL_S3_PREFIX=rl-deepracer-sagemaker -MODEL_S3_BUCKET=bucket -LOCAL=True -MARKOV_PRESET_FILE=deepracer.py -XAUTHORITY=/root/.Xauthority -DISPLAY_N=:0 -METRIC_NAME=reward -METRIC_NAMESPACE=deepracer -APP_REGION=us-east-1 -SAGEMAKER_SHARED_S3_PREFIX=rl-deepracer-sagemaker -SAGEMAKER_SHARED_S3_BUCKET=bucket -TRAINING_JOB_ARN=aaa -METRICS_S3_BUCKET=bucket -METRICS_S3_OBJECT_KEY=custom_files/metric.json -ROBOMAKER_RUN_TYPE=distributed_training -TARGET_REWARD_SCORE=100000 -NUMBER_OF_EPISODES=20000 -ROBOMAKER_SIMULATION_JOB_ACCOUNT_ID=aaa -AWS_ROBOMAKER_SIMULATION_JOB_ID=aaa -MODEL_METADATA_FILE_S3_KEY=custom_files/model_metadata.json -REWARD_FILE_S3_KEY=custom_files/reward.py -BUNDLE_CURRENT_PREFIX=/app/robomaker-deepracer/simulation_ws/ -GPU_AVAILABLE=True -NUMBER_OF_TRIALS=5 \ No newline at end of file diff --git a/docker/docker-compose-aws.yml b/docker/docker-compose-aws.yml new file mode 100644 index 00000000..4d2137c8 --- /dev/null +++ b/docker/docker-compose-aws.yml @@ -0,0 +1,11 @@ +version: '3.7' + +services: + rl_coach: + environment: + - AWS_METADATA_SERVICE_TIMEOUT=3 + - AWS_METADATA_SERVICE_NUM_ATTEMPTS=5 + robomaker: + environment: + - AWS_METADATA_SERVICE_TIMEOUT=3 + - AWS_METADATA_SERVICE_NUM_ATTEMPTS=5 diff --git a/docker/docker-compose-cwlog.yml b/docker/docker-compose-cwlog.yml new file mode 100644 index 00000000..48f2c0a6 --- /dev/null +++ b/docker/docker-compose-cwlog.yml @@ -0,0 +1,19 @@ +version: '3.7' + +services: + rl_coach: + logging: + driver: awslogs + options: + awslogs-group: '/deepracer-for-cloud' + awslogs-create-group: 'true' + awslogs-region: ${DR_AWS_APP_REGION} + tag: "${DR_CLOUD_WATCH_LOG_STREAM_PREFIX}{{.Name}}" + robomaker: + logging: + driver: awslogs + options: + awslogs-group: '/deepracer-for-cloud' + awslogs-create-group: 'true' + awslogs-region: ${DR_AWS_APP_REGION} + tag: "${DR_CLOUD_WATCH_LOG_STREAM_PREFIX}{{.Name}}" \ No newline at end of file diff --git a/docker/docker-compose-endpoint.yml b/docker/docker-compose-endpoint.yml new file mode 100644 index 00000000..83d5eb4c --- /dev/null +++ b/docker/docker-compose-endpoint.yml @@ -0,0 +1,9 @@ +version: '3.7' + +services: + rl_coach: + environment: + - S3_ENDPOINT_URL=${DR_MINIO_URL} + robomaker: + environment: + - S3_ENDPOINT_URL=${DR_MINIO_URL} diff --git a/docker/docker-compose-eval-swarm.yml b/docker/docker-compose-eval-swarm.yml new file mode 100644 index 00000000..753dd99b --- /dev/null +++ b/docker/docker-compose-eval-swarm.yml @@ -0,0 +1,18 @@ +version: '3.7' + +services: + rl_coach: + deploy: + restart_policy: + condition: none + placement: + constraints: [node.labels.Sagemaker == true ] + robomaker: + deploy: + restart_policy: + condition: none + replicas: 1 + placement: + constraints: [node.labels.Robomaker == true ] + environment: + - DOCKER_REPLICA_SLOT={{.Task.Slot}} \ No newline at end of file diff --git a/docker/docker-compose-eval.yml b/docker/docker-compose-eval.yml new file mode 100644 index 00000000..c8b9b715 --- /dev/null +++ b/docker/docker-compose-eval.yml @@ -0,0 +1,36 @@ +version: '3.7' + +networks: + default: + external: true + name: sagemaker-local + +services: + rl_coach: + image: ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION} + command: ["/bin/bash", "-c", "echo No work for coach in Evaluation Mode"] + robomaker: + image: ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION} + command: ["${ROBOMAKER_COMMAND:-}"] + ports: + - "${DR_ROBOMAKER_EVAL_PORT}:8080" + environment: + - CUDA_VISIBLE_DEVICES=${DR_ROBOMAKER_CUDA_DEVICES:-} + - DEBUG_REWARD=${DR_EVAL_DEBUG_REWARD} + - WORLD_NAME=${DR_WORLD_NAME} + - MODEL_S3_PREFIX=${DR_LOCAL_S3_MODEL_PREFIX} + - MODEL_S3_BUCKET=${DR_LOCAL_S3_BUCKET} + - APP_REGION=${DR_AWS_APP_REGION} + - S3_YAML_NAME=${DR_CURRENT_PARAMS_FILE} + - KINESIS_VIDEO_STREAM_NAME=${DR_KINESIS_STREAM_NAME} + - ENABLE_KINESIS=${DR_CAMERA_KVS_ENABLE} + - ENABLE_GUI=${DR_GUI_ENABLE} + - ROLLOUT_IDX=0 + - RTF_OVERRIDE=${DR_EVAL_RTF:-} + - ROS_MASTER_URI=http://localhost:11311/ + - ROS_IP=127.0.0.1 + - GAZEBO_ARGS=${DR_GAZEBO_ARGS:-} + - GAZEBO_RENDER_ENGINE=${DR_GAZEBO_RENDER_ENGINE:-ogre} + - TELEGRAF_HOST=${DR_TELEGRAF_HOST:-} + - TELEGRAF_PORT=${DR_TELEGRAF_PORT:-} + init: true \ No newline at end of file diff --git a/docker/docker-compose-keys.yml b/docker/docker-compose-keys.yml new file mode 100644 index 00000000..2fb3aebe --- /dev/null +++ b/docker/docker-compose-keys.yml @@ -0,0 +1,11 @@ +version: '3.7' + +services: + rl_coach: + environment: + - AWS_ACCESS_KEY_ID=${DR_LOCAL_ACCESS_KEY_ID} + - AWS_SECRET_ACCESS_KEY=${DR_LOCAL_SECRET_ACCESS_KEY} + robomaker: + environment: + - AWS_ACCESS_KEY_ID=${DR_LOCAL_ACCESS_KEY_ID} + - AWS_SECRET_ACCESS_KEY=${DR_LOCAL_SECRET_ACCESS_KEY} diff --git a/docker/docker-compose-local-xorg-wsl.yml b/docker/docker-compose-local-xorg-wsl.yml new file mode 100644 index 00000000..06d9fb4a --- /dev/null +++ b/docker/docker-compose-local-xorg-wsl.yml @@ -0,0 +1,15 @@ +version: '3.7' + +services: + robomaker: + environment: + - DISPLAY + - USE_EXTERNAL_X=${DR_HOST_X} + - QT_X11_NO_MITSHM=1 + - LD_LIBRARY_PATH=/usr/lib/wsl/lib + volumes: + - '/tmp/.X11-unix/:/tmp/.X11-unix' + - '/mnt/wslg:/mnt/wslg' + - '/usr/lib/wsl:/usr/lib/wsl' + devices: + - /dev/dxg diff --git a/docker/docker-compose-local-xorg.yml b/docker/docker-compose-local-xorg.yml new file mode 100644 index 00000000..a1605421 --- /dev/null +++ b/docker/docker-compose-local-xorg.yml @@ -0,0 +1,13 @@ +version: '3.7' + +services: + robomaker: + environment: + - DISPLAY + - USE_EXTERNAL_X=${DR_HOST_X} + - XAUTHORITY=/root/.Xauthority + - QT_X11_NO_MITSHM=1 + - NVIDIA_DRIVER_CAPABILITIES=all + volumes: + - '/tmp/.X11-unix/:/tmp/.X11-unix' + - '${XAUTHORITY}:/root/.Xauthority' \ No newline at end of file diff --git a/docker/docker-compose-local.yml b/docker/docker-compose-local.yml new file mode 100644 index 00000000..5d29115a --- /dev/null +++ b/docker/docker-compose-local.yml @@ -0,0 +1,24 @@ + +version: '3.7' + +networks: + default: + external: true + name: sagemaker-local + +services: + minio: + image: minio/minio:${DR_MINIO_IMAGE} + ports: + - "9000:9000" + - "9001:9001" + command: server /data --console-address ":9001" + environment: + - MINIO_ROOT_USER=${DR_LOCAL_ACCESS_KEY_ID} + - MINIO_ROOT_PASSWORD=${DR_LOCAL_SECRET_ACCESS_KEY} + - MINIO_UID + - MINIO_GID + - MINIO_USERNAME + - MINIO_GROUPNAME + volumes: + - ${DR_DIR}/data/minio:/data diff --git a/docker/docker-compose-metrics.yml b/docker/docker-compose-metrics.yml new file mode 100644 index 00000000..68c40840 --- /dev/null +++ b/docker/docker-compose-metrics.yml @@ -0,0 +1,45 @@ + +version: '3.7' + +networks: + default: + external: true + name: sagemaker-local + +services: + telegraf: + image: telegraf:1.18-alpine + volumes: + - ./metrics/telegraf/etc/telegraf.conf:/etc/telegraf/telegraf.conf:ro + depends_on: + - influxdb + links: + - influxdb + ports: + - '127.0.0.1:8125:8125/udp' + - '127.0.0.1:8092:8092/udp' + + influxdb: + image: influxdb:1.8-alpine + env_file: ./metrics/configuration.env + ports: + - '127.0.0.1:8886:8086' + volumes: + - influxdb_data:/var/lib/influxdb + + grafana: + image: grafana/grafana:10.4.2 + depends_on: + - influxdb + env_file: ./metrics/configuration.env + links: + - influxdb + ports: + - '3000:3000' + volumes: + - grafana_data:/var/lib/grafana + - ./metrics/grafana/provisioning/:/etc/grafana/provisioning/ + +volumes: + grafana_data: {} + influxdb_data: {} diff --git a/docker/docker-compose-mount.yml b/docker/docker-compose-mount.yml new file mode 100644 index 00000000..20ccc9f0 --- /dev/null +++ b/docker/docker-compose-mount.yml @@ -0,0 +1,6 @@ +version: '3.7' + +services: + robomaker: + volumes: + - "${DR_MOUNT_DIR}:/root/.ros/log" diff --git a/docker/docker-compose-robomaker-multi.yml b/docker/docker-compose-robomaker-multi.yml new file mode 100644 index 00000000..62718412 --- /dev/null +++ b/docker/docker-compose-robomaker-multi.yml @@ -0,0 +1,6 @@ +version: '3.7' + +services: + robomaker: + volumes: + - "${DR_DIR}/tmp/comms.${DR_RUN_ID}:/mnt/comms" diff --git a/docker/docker-compose-robomaker-scripts.yml b/docker/docker-compose-robomaker-scripts.yml new file mode 100644 index 00000000..e90d5b16 --- /dev/null +++ b/docker/docker-compose-robomaker-scripts.yml @@ -0,0 +1,6 @@ +version: '3.7' + +services: + robomaker: + volumes: + - '${DR_ROBOMAKER_MOUNT_SCRIPTS_DIR}:/scripts' \ No newline at end of file diff --git a/docker/docker-compose-simapp.yml b/docker/docker-compose-simapp.yml new file mode 100644 index 00000000..7533ac2b --- /dev/null +++ b/docker/docker-compose-simapp.yml @@ -0,0 +1,6 @@ +version: '3.7' + +services: + robomaker: + volumes: + - '${DR_ROBOMAKER_MOUNT_SIMAPP_DIR}:/opt/simapp' \ No newline at end of file diff --git a/docker/docker-compose-training-swarm.yml b/docker/docker-compose-training-swarm.yml new file mode 100644 index 00000000..57650970 --- /dev/null +++ b/docker/docker-compose-training-swarm.yml @@ -0,0 +1,18 @@ +version: '3.7' + +services: + rl_coach: + deploy: + restart_policy: + condition: none + placement: + constraints: [node.labels.Sagemaker == true ] + robomaker: + deploy: + restart_policy: + condition: none + replicas: ${DR_WORKERS} + placement: + constraints: [node.labels.Robomaker == true ] + environment: + - DOCKER_REPLICA_SLOT={{.Task.Slot}} \ No newline at end of file diff --git a/docker/docker-compose-training.yml b/docker/docker-compose-training.yml new file mode 100644 index 00000000..b000fa59 --- /dev/null +++ b/docker/docker-compose-training.yml @@ -0,0 +1,57 @@ +version: "3.7" + +networks: + default: + external: true + name: sagemaker-local + +services: + rl_coach: + image: ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION} + command: ["source /root/sagemaker-venv/bin/activate && python3 /opt/ml/code/rl_coach/start.py"] + working_dir: "/opt/ml/code/" + environment: + - RUN_ID=${DR_RUN_ID} + - AWS_REGION=${DR_AWS_APP_REGION} + - SAGEMAKER_IMAGE=${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION} + - PRETRAINED=${DR_LOCAL_S3_PRETRAINED} + - PRETRAINED_S3_PREFIX=${DR_LOCAL_S3_PRETRAINED_PREFIX} + - PRETRAINED_S3_BUCKET=${DR_LOCAL_S3_BUCKET} + - PRETRAINED_CHECKPOINT=${DR_LOCAL_S3_PRETRAINED_CHECKPOINT} + - MODEL_S3_PREFIX=${DR_LOCAL_S3_MODEL_PREFIX} + - MODEL_S3_BUCKET=${DR_LOCAL_S3_BUCKET} + - HYPERPARAMETER_FILE_S3_KEY=${DR_LOCAL_S3_HYPERPARAMETERS_KEY} + - MODELMETADATA_FILE_S3_KEY=${DR_LOCAL_S3_MODEL_METADATA_KEY} + - CUDA_VISIBLE_DEVICES=${DR_SAGEMAKER_CUDA_DEVICES:-} + - MAX_MEMORY_STEPS=${DR_TRAIN_MAX_STEPS_PER_ITERATION:-} + - TELEGRAF_HOST=${DR_TELEGRAF_HOST:-} + - TELEGRAF_PORT=${DR_TELEGRAF_PORT:-} + + volumes: + - "/var/run/docker.sock:/var/run/docker.sock" + - "/tmp/sagemaker:/tmp/sagemaker" + robomaker: + image: ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION} + command: ["${ROBOMAKER_COMMAND:-}"] + ports: + - "${DR_ROBOMAKER_TRAIN_PORT}:8080" + - "${DR_ROBOMAKER_GUI_PORT}:5900" + environment: + - WORLD_NAME=${DR_WORLD_NAME} + - SAGEMAKER_SHARED_S3_PREFIX=${DR_LOCAL_S3_MODEL_PREFIX} + - SAGEMAKER_SHARED_S3_BUCKET=${DR_LOCAL_S3_BUCKET} + - APP_REGION=${DR_AWS_APP_REGION} + - S3_YAML_NAME=${DR_CURRENT_PARAMS_FILE} + - KINESIS_VIDEO_STREAM_NAME=${DR_KINESIS_STREAM_NAME} + - ENABLE_KINESIS=${DR_CAMERA_KVS_ENABLE} + - ENABLE_GUI=${DR_GUI_ENABLE} + - CUDA_VISIBLE_DEVICES=${DR_ROBOMAKER_CUDA_DEVICES:-} + - MULTI_CONFIG + - RTF_OVERRIDE=${DR_TRAIN_RTF:-} + - ROS_MASTER_URI=http://localhost:11311/ + - ROS_IP=127.0.0.1 + - GAZEBO_ARGS=${DR_GAZEBO_ARGS:-} + - GAZEBO_RENDER_ENGINE=${DR_GAZEBO_RENDER_ENGINE:-ogre} + - TELEGRAF_HOST=${DR_TELEGRAF_HOST:-} + - TELEGRAF_PORT=${DR_TELEGRAF_PORT:-} + init: true diff --git a/docker/docker-compose-webviewer-swarm.yml b/docker/docker-compose-webviewer-swarm.yml new file mode 100644 index 00000000..bec31188 --- /dev/null +++ b/docker/docker-compose-webviewer-swarm.yml @@ -0,0 +1,15 @@ +version: '3.7' + +networks: + default: + external: true + name: sagemaker-local + +services: + proxy: + deploy: + restart_policy: + condition: none + replicas: 1 + placement: + constraints: [node.labels.Sagemaker == true ] diff --git a/docker/docker-compose-webviewer.yml b/docker/docker-compose-webviewer.yml new file mode 100644 index 00000000..8b148ae8 --- /dev/null +++ b/docker/docker-compose-webviewer.yml @@ -0,0 +1,16 @@ +version: '3.7' + +networks: + default: + external: true + name: sagemaker-local + +services: + proxy: + image: nginx + ports: + - "${DR_WEBVIEWER_PORT}:80" + volumes: + - ${DR_VIEWER_HTML}:/usr/share/nginx/html/index.html + - ${DR_NGINX_CONF}:/etc/nginx/conf.d/default.conf + diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml deleted file mode 100644 index 0b9b0740..00000000 --- a/docker/docker-compose.yml +++ /dev/null @@ -1,42 +0,0 @@ -version: '3.7' - -networks: - default: - external: - name: sagemaker-local - -services: - minio: - image: minio/minio - ports: - - "9000:9000" - container_name: minio - command: server /data - volumes: - - ./volumes/minio:/data - restart: unless-stopped - env_file: .env - rl_coach: - image: aschu/rl_coach - env_file: .env - container_name: rl_coach - volumes: - - '//var/run/docker.sock:/var/run/docker.sock' - - '../deepracer/sagemaker-python-sdk:/deepracer/sagemaker-python-sdk' - - '../deepracer/rl_coach:/deepracer/rl_coach' - - '/robo/container:/robo/container' - depends_on: - - minio - robomaker: - image: crr0004/deepracer_robomaker:console - command: ["${ROBOMAKER_COMMAND}"] - volumes: - - ../deepracer/simulation/aws-robomaker-sample-application-deepracer/simulation_ws/src:/app/robomaker-deepracer/simulation_ws/src - - ./volumes/robo/checkpoint:/root/.ros/ - ports: - - "8080:5900" - container_name: robomaker - restart: unless-stopped - env_file: .env - depends_on: - - rl_coach diff --git a/docker/dockerfiles/rl_coach/Dockerfile b/docker/dockerfiles/rl_coach/Dockerfile deleted file mode 100644 index 19fbfcae..00000000 --- a/docker/dockerfiles/rl_coach/Dockerfile +++ /dev/null @@ -1,31 +0,0 @@ -FROM python:3.7.3-stretch - -# install docker -RUN apt-get update -RUN apt-get -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common -RUN curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add - -RUN add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable" - -RUN apt-get update -RUN apt-get -y install docker-ce - -# add required deepracer directories to the container -RUN mkdir /deepracer -RUN mkdir /deepracer/rl_coach -RUN mkdir /deepracer/sagemaker-python-sdk -WORKDIR /deepracer -ADD rl_coach rl_coach -ADD sagemaker-python-sdk sagemaker-python-sdk - -# create sagemaker configuration -RUN mkdir /root/.sagemaker -COPY config.yaml /root/.sagemaker/config.yaml - -RUN mkdir /robo -RUN mkdir /robo/container - -# install dependencies -RUN pip install -U sagemaker-python-sdk/ awscli ipython pandas "urllib3==1.22" "pyyaml==3.13" - -# set command -CMD (cd rl_coach; ipython rl_deepracer_coach_robomaker.py) \ No newline at end of file diff --git a/docker/metrics/configuration.env b/docker/metrics/configuration.env new file mode 100644 index 00000000..7da5a24a --- /dev/null +++ b/docker/metrics/configuration.env @@ -0,0 +1,9 @@ +# Grafana options +GF_SECURITY_ADMIN_USER=admin +GF_SECURITY_ADMIN_PASSWORD=admin +GF_INSTALL_PLUGINS= + +# InfluxDB options +INFLUXDB_DB=influx +INFLUXDB_ADMIN_USER=admin +INFLUXDB_ADMIN_PASSWORD=admin diff --git a/docker/metrics/grafana/provisioning/dashboards/dashboard.yml b/docker/metrics/grafana/provisioning/dashboards/dashboard.yml new file mode 100644 index 00000000..024acd63 --- /dev/null +++ b/docker/metrics/grafana/provisioning/dashboards/dashboard.yml @@ -0,0 +1,7 @@ +apiVersion: 1 + +providers: +- name: 'Default' + folder: '' + options: + path: /etc/grafana/provisioning/dashboards diff --git a/docker/metrics/grafana/provisioning/dashboards/deepracer-training-template.json b/docker/metrics/grafana/provisioning/dashboards/deepracer-training-template.json new file mode 100644 index 00000000..5de2df71 --- /dev/null +++ b/docker/metrics/grafana/provisioning/dashboards/deepracer-training-template.json @@ -0,0 +1,1259 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "datasource", + "uid": "grafana" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "limit": 100, + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 2, + "id": 1, + "links": [], + "panels": [ + { + "datasource": {}, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 9, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "reward" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 11, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 6, + "options": { + "legend": { + "calcs": [ + "min", + "mean", + "max", + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "sortBy": "Max", + "sortDesc": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "alias": "$tag_model training reward", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "A", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "reward" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + } + ] + ], + "tags": [ + { + "key": "phase", + "operator": "=", + "value": "training" + } + ] + }, + { + "alias": "$tag_model complete lap reward", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "B", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "reward" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + } + ] + ], + "tags": [ + { + "key": "status", + "operator": "=", + "value": "Lap complete" + } + ] + }, + { + "alias": "$tag_model eval reward", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "C", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "reward" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + } + ] + ], + "tags": [ + { + "key": "phase", + "operator": "=", + "value": "evaluation" + } + ] + } + ], + "title": "Reward", + "type": "timeseries" + }, + { + "datasource": {}, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "points", + "fillOpacity": 3, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 4, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": ".*eval progress moving average" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.showPoints", + "value": "never" + }, + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byRegexp", + "options": ".*training progress moving average" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.showPoints", + "value": "never" + }, + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 11 + }, + "id": 4, + "options": { + "legend": { + "calcs": [ + "min", + "mean", + "max" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "alias": "$tag_model training progress", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "A", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "progress" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + } + ] + ], + "tags": [ + { + "key": "phase", + "operator": "=", + "value": "training" + } + ] + }, + { + "alias": "$tag_model eval progress", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "B", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "progress" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + } + ] + ], + "tags": [ + { + "key": "phase", + "operator": "=", + "value": "evaluation" + } + ] + }, + { + "alias": "$tag_model eval progress moving average", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "C", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "progress" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + }, + { + "params": [ + 30 + ], + "type": "moving_average" + } + ] + ], + "tags": [ + { + "key": "phase", + "operator": "=", + "value": "evaluation" + } + ] + }, + { + "alias": "$tag_model training progress moving average", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "D", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "progress" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + }, + { + "params": [ + 30 + ], + "type": "moving_average" + } + ] + ], + "tags": [ + { + "key": "phase", + "operator": "=", + "value": "training" + } + ] + } + ], + "title": "Progress", + "type": "timeseries" + }, + { + "datasource": {}, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "points", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 4, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 3, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": ".*eval lap moving average" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.showPoints", + "value": "never" + }, + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + }, + { + "id": "custom.lineWidth", + "value": 2 + } + ] + }, + { + "matcher": { + "id": "byRegexp", + "options": ".*training lap moving average" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.showPoints", + "value": "never" + }, + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + }, + { + "id": "custom.lineWidth", + "value": 2 + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 21 + }, + "id": 2, + "options": { + "legend": { + "calcs": [ + "min", + "mean", + "max", + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "alias": "$tag_model training lap ", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "A", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "elapsed_time" + ], + "type": "field" + }, + { + "params": [], + "type": "min" + } + ] + ], + "tags": [ + { + "key": "status", + "operator": "=", + "value": "Lap complete" + }, + { + "condition": "AND", + "key": "phase", + "operator": "=", + "value": "training" + } + ] + }, + { + "alias": "$tag_model eval lap ", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "B", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "elapsed_time" + ], + "type": "field" + }, + { + "params": [], + "type": "min" + } + ] + ], + "tags": [ + { + "key": "status", + "operator": "=", + "value": "Lap complete" + }, + { + "condition": "AND", + "key": "phase", + "operator": "=", + "value": "evaluation" + } + ] + }, + { + "alias": "$tag_model eval lap moving average", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "C", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "elapsed_time" + ], + "type": "field" + }, + { + "params": [], + "type": "min" + }, + { + "params": [ + 30 + ], + "type": "moving_average" + } + ] + ], + "tags": [ + { + "key": "status", + "operator": "=", + "value": "Lap complete" + }, + { + "condition": "AND", + "key": "phase", + "operator": "=", + "value": "evaluation" + } + ] + }, + { + "alias": "$tag_model training lap moving average", + "datasource": { + "type": "influxdb", + "uid": "${DS_INFLUXDB}" + }, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "hide": false, + "measurement": "dr_training_episodes", + "orderByTime": "ASC", + "policy": "default", + "refId": "D", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "elapsed_time" + ], + "type": "field" + }, + { + "params": [], + "type": "min" + }, + { + "params": [ + 30 + ], + "type": "moving_average" + } + ] + ], + "tags": [ + { + "key": "status", + "operator": "=", + "value": "Lap complete" + }, + { + "condition": "AND", + "key": "phase", + "operator": "=", + "value": "training" + } + ] + } + ], + "title": "Training Complete Lap times", + "type": "timeseries" + }, + { + "datasource": {}, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "points", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": ".*entropy moving average" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.showPoints", + "value": "never" + }, + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 11, + "w": 24, + "x": 0, + "y": 31 + }, + "id": 7, + "options": { + "legend": { + "calcs": [ + "min", + "mean", + "max", + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "alias": "$tag_model entropy", + "datasource": {}, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + }, + { + "params": [ + "none" + ], + "type": "fill" + } + ], + "measurement": "dr_sagemaker_epochs", + "orderByTime": "ASC", + "policy": "default", + "refId": "A", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "entropy" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + } + ] + ], + "tags": [] + }, + { + "alias": "$tag_model entropy moving average", + "datasource": {}, + "groupBy": [ + { + "params": [ + "$__interval" + ], + "type": "time" + }, + { + "params": [ + "model" + ], + "type": "tag" + } + ], + "hide": false, + "measurement": "dr_sagemaker_epochs", + "orderByTime": "ASC", + "policy": "default", + "refId": "B", + "resultFormat": "time_series", + "select": [ + [ + { + "params": [ + "entropy" + ], + "type": "field" + }, + { + "params": [], + "type": "mean" + }, + { + "params": [ + 10 + ], + "type": "moving_average" + } + ] + ], + "tags": [] + } + ], + "title": "Epoch", + "type": "timeseries" + } + ], + "refresh": "10s", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "DeepRacer Training template", + "uid": "adke0lwv5zwg0e", + "version": 1, + "weekStart": "" +} \ No newline at end of file diff --git a/docker/metrics/grafana/provisioning/datasources/influxdb.yml b/docker/metrics/grafana/provisioning/datasources/influxdb.yml new file mode 100644 index 00000000..8a254bfd --- /dev/null +++ b/docker/metrics/grafana/provisioning/datasources/influxdb.yml @@ -0,0 +1,46 @@ +# config file version +apiVersion: 1 + +# list of datasources that should be deleted from the database +deleteDatasources: + - name: Influxdb + orgId: 1 + +# list of datasources to insert/update depending +# whats available in the database +datasources: + # name of the datasource. Required +- name: InfluxDB + # datasource type. Required + type: influxdb + # access mode. direct or proxy. Required + access: proxy + # org id. will default to orgId 1 if not specified + orgId: 1 + # url + url: http://influxdb:8086 + # database password, if used + password: "admin" + # database user, if used + user: "admin" + # database name, if used + database: "influx" + # enable/disable basic auth + basicAuth: false +# withCredentials: + # mark as default datasource. Max one per org + isDefault: true + # fields that will be converted to json and stored in json_data + jsonData: + timeInterval: "5s" +# graphiteVersion: "1.1" +# tlsAuth: false +# tlsAuthWithCACert: false +# # json object of data that will be encrypted. +# secureJsonData: +# tlsCACert: "..." +# tlsClientCert: "..." +# tlsClientKey: "..." + version: 1 + # allow users to edit datasources from the UI. + editable: false diff --git a/docker/metrics/telegraf/etc/telegraf.conf b/docker/metrics/telegraf/etc/telegraf.conf new file mode 100644 index 00000000..6eb80aea --- /dev/null +++ b/docker/metrics/telegraf/etc/telegraf.conf @@ -0,0 +1,215 @@ +# Telegraf configuration + +# Telegraf is entirely plugin driven. All metrics are gathered from the +# declared inputs, and sent to the declared outputs. + +# Plugins must be declared in here to be active. +# To deactivate a plugin, comment out the name and any variables. + +# Use 'telegraf -config telegraf.conf -test' to see what metrics a config +# file would generate. + +# Global tags can be specified here in key="value" format. +[global_tags] + # dc = "us-east-1" # will tag all metrics with dc=us-east-1 + # rack = "1a" + +# Configuration for telegraf agent +[agent] + ## Default data collection interval for all inputs + interval = "5s" + ## Rounds collection interval to 'interval' + ## ie, if interval="10s" then always collect on :00, :10, :20, etc. + round_interval = true + + ## Telegraf will cache metric_buffer_limit metrics for each output, and will + ## flush this buffer on a successful write. + metric_buffer_limit = 10000 + ## Flush the buffer whenever full, regardless of flush_interval. + flush_buffer_when_full = true + + ## Collection jitter is used to jitter the collection by a random amount. + ## Each plugin will sleep for a random time within jitter before collecting. + ## This can be used to avoid many plugins querying things like sysfs at the + ## same time, which can have a measurable effect on the system. + collection_jitter = "0s" + + ## Default flushing interval for all outputs. You shouldn't set this below + ## interval. Maximum flush_interval will be flush_interval + flush_jitter + flush_interval = "1s" + ## Jitter the flush interval by a random amount. This is primarily to avoid + ## large write spikes for users running a large number of telegraf instances. + ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s + flush_jitter = "0s" + + ## Run telegraf in debug mode + debug = false + ## Run telegraf in quiet mode + quiet = false + ## Override default hostname, if empty use os.Hostname() + hostname = "" + + +############################################################################### +# OUTPUTS # +############################################################################### + +# Configuration for influxdb server to send metrics to +[[outputs.influxdb]] + # The full HTTP or UDP endpoint URL for your InfluxDB instance. + # Multiple urls can be specified but it is assumed that they are part of the same + # cluster, this means that only ONE of the urls will be written to each interval. + # urls = ["udp://localhost:8089"] # UDP endpoint example + urls = ["http://influxdb:8086"] # required + # The target database for metrics (telegraf will create it if not exists) + database = "influx" # required + # Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h". + # note: using second precision greatly helps InfluxDB compression + precision = "s" + + ## Write timeout (for the InfluxDB client), formatted as a string. + ## If not provided, will default to 5s. 0s means no timeout (not recommended). + timeout = "5s" + # username = "telegraf" + # password = "metricsmetricsmetricsmetrics" + # Set the user agent for HTTP POSTs (can be useful for log differentiation) + # user_agent = "telegraf" + # Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes) + # udp_payload = 512 + + +############################################################################### +# INPUTS # +############################################################################### +# Statsd Server +[[inputs.statsd]] + ## Protocol, must be "tcp", "udp4", "udp6" or "udp" (default=udp) + protocol = "udp" + + ## MaxTCPConnection - applicable when protocol is set to tcp (default=250) + max_tcp_connections = 250 + + ## Enable TCP keep alive probes (default=false) + tcp_keep_alive = false + + ## Specifies the keep-alive period for an active network connection. + ## Only applies to TCP sockets and will be ignored if tcp_keep_alive is false. + ## Defaults to the OS configuration. + # tcp_keep_alive_period = "2h" + + ## Address and port to host UDP listener on + service_address = ":8125" + + ## The following configuration options control when telegraf clears it's cache + ## of previous values. If set to false, then telegraf will only clear it's + ## cache when the daemon is restarted. + ## Reset gauges every interval (default=true) + delete_gauges = true + ## Reset counters every interval (default=true) + delete_counters = true + ## Reset sets every interval (default=true) + delete_sets = true + ## Reset timings & histograms every interval (default=true) + delete_timings = true + + ## Percentiles to calculate for timing & histogram stats + percentiles = [90] + + ## separator to use between elements of a statsd metric + metric_separator = "_" + + ## Parses tags in the datadog statsd format + ## http://docs.datadoghq.com/guides/dogstatsd/ + parse_data_dog_tags = false + + ## Statsd data translation templates, more info can be read here: + ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite + # templates = [ + # "cpu.* measurement*" + # ] + + ## Number of UDP messages allowed to queue up, once filled, + ## the statsd server will start dropping packets + allowed_pending_messages = 10000 + + ## Number of timing/histogram values to track per-measurement in the + ## calculation of percentiles. Raising this limit increases the accuracy + ## of percentiles but also increases the memory usage and cpu time. + percentile_limit = 1000 + + ## Maximum socket buffer size in bytes, once the buffer fills up, metrics + ## will start dropping. Defaults to the OS default. + # read_buffer_size = 65535 + +# Read metrics about cpu usage +[[inputs.cpu]] + ## Whether to report per-cpu stats or not + percpu = true + ## Whether to report total system cpu stats or not + totalcpu = true + ## Comment this line if you want the raw CPU time metrics + fielddrop = ["time_*"] + + +# Read metrics about disk usage by mount point +[[inputs.disk]] + ## By default, telegraf gather stats for all mountpoints. + ## Setting mountpoints will restrict the stats to the specified mountpoints. + # mount_points = ["/"] + + ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually + ## present on /run, /var/run, /dev/shm or /dev). + ignore_fs = ["tmpfs", "devtmpfs"] + + +# Read metrics about disk IO by device +[[inputs.diskio]] + ## By default, telegraf will gather stats for all devices including + ## disk partitions. + ## Setting devices will restrict the stats to the specified devices. + # devices = ["sda", "sdb"] + ## Uncomment the following line if you need disk serial numbers. + # skip_serial_number = false + + +# Get kernel statistics from /proc/stat +[[inputs.kernel]] + # no configuration + + +# Read metrics about memory usage +[[inputs.mem]] + # no configuration + + +# Get the number of processes and group them by status +[[inputs.processes]] + # no configuration + + +# Read metrics about swap memory usage +[[inputs.swap]] + # no configuration + + +# Read metrics about system load & uptime +[[inputs.system]] + # no configuration + +# Read metrics about network interface usage +[[inputs.net]] + # collect data only about specific interfaces + # interfaces = ["eth0"] + + +[[inputs.netstat]] + # no configuration + +[[inputs.interrupts]] + # no configuration + +[[inputs.linux_sysctl_fs]] + # no configuration + +[[inputs.socket_listener]] + service_address = "udp://:8092" \ No newline at end of file diff --git a/docs/_config.yml b/docs/_config.yml new file mode 100644 index 00000000..5c24e7b9 --- /dev/null +++ b/docs/_config.yml @@ -0,0 +1,9 @@ +--- +theme: jekyll-theme-slate +markdown: GFM +name: Deepracer-for-Cloud +plugins: + - jekyll-relative-links +relative_links: + enabled: true + collections: false \ No newline at end of file diff --git a/docs/docker.md b/docs/docker.md new file mode 100644 index 00000000..8a3ce105 --- /dev/null +++ b/docs/docker.md @@ -0,0 +1,49 @@ +# About the Docker setup + +DRfC supports running Docker in to modes `swarm` and `compose` - this behaviour is configured in `system.env` through `DR_DOCKER_STYLE`. + +## Swarm Mode + +Docker Swarm mode is the default. Docker Swarm makes it possible to connect multiple hosts together to spread the load -- esp. useful if one wants to run multiple Robomaker workers, but can also be useful locally if one has two computers that each are not powerful enough to run DeepRacer. + +In Swarm mode DRfC creates Stacks, using `docker stack`. During operations one can check running stacks through `docker stack ls`, and running services through `docker stack ls`. + +DRfC is installed only on the manager. (The first installed host.) Swarm workers are 'dumb' and do not need to have DRfC installed. + +### Key features + +* Allows user to connect multiple computers on the same network. (In AWS the instances must be connected on same VPC, and instances must be allowed to communicate.) +* Supports [multiple Robomaker workers](multi_worker.md) +* Supports [running multiple parallel experiments](multi_run.md) + +### Limitations + +* The Sagemaker container can only be run on the manager. +* Docker images are downloaded from Docker Hub. Locally built images are allowed only if they have a unique tag, not in Docker Hub. If you have multiple Docker nodes ensure that they all have the image available. + +### Connecting Workers + +* On the manager run `docker swarm join-token manager`. +* On the worker run the command that was displayed on the manager `docker swarm join --token :`. + +### Ports + +Docker Swarm will automatically put a load-balancer in front of all replicas in a service. This means that the ROS Web View, which provides a video stream of the DeepRacer during training, will be load balanced - sharing one port (`8080`). If you have multiple workers (even across multiple hosts) then press F5 to cycle through them. + +## Compose Mode + +In Compose mode DRfC creates Services, using `docker compose`. During operations one can check running stacks through `docker service ls`, and running services through `docker service ps`. + +### Key features + +* Supports [multiple Robomaker workers](multi_worker.md) +* Supports [running multiple parallel experiments](multi_run.md) +* Supports [GPU Accelerated OpenGL for Robomaker](opengl.md) + +### Limitations + +* Workload cannot be spread across multiple hosts. + +### Ports + +In the case of using Docker Compose the different Robomaker worker will require unique ports for ROS Web Vew and VNC. Docker will assign these dynamically. Use `docker ps` to see which container has been assigned which ports. diff --git a/docs/head-to-head.md b/docs/head-to-head.md new file mode 100644 index 00000000..f8ea6ba1 --- /dev/null +++ b/docs/head-to-head.md @@ -0,0 +1,26 @@ +# Head-to-Head Race (Beta) + +It is possible to run a head-to-head race, similar to the races in the brackets +run by AWS in the Virtual Circuits to determine the winner of the head-to-bot races. + +This replaces the "Tournament Mode". + +## Introduction + +The concept is that you have two models racing each other, one Purple and one Orange Car. One car +is powered by our primary configured model, and the second car is powered by the model in `DR_EVAL_OPP_S3_MODEL_PREFIX` + +## Configuration + +### run.env + +Configure `run.env` with the following parameters: +* `DR_RACE_TYPE` should be `HEAD_TO_MODEL`. +* `DR_EVAL_OPP_S3_MODEL_PREFIX` will be the S3 prefix for the secondary model. +* `DR_EVAL_OPP_CAR_NAME` is the display name of this model. + +Metrics, Traces and Videos will be stored in each models' prefix. + +## Run + +Run the race with `dr-start-evaluation`; one race will be run. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 00000000..a65ef1cb --- /dev/null +++ b/docs/index.md @@ -0,0 +1,48 @@ +# Introduction + +Provides a quick and easy way to get up and running with a DeepRacer training environment in AWS or Azure, using either the Azure [N-Series Virtual Machines](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [AWS EC2 Accelerated Computing instances](https://aws.amazon.com/ec2/instance-types/?nc1=h_ls#Accelerated_Computing), or locally on your own desktop or server. + +DeepRacer-For-Cloud (DRfC) started as an extension of the work done by Alex (https://github.com/alexschultz/deepracer-for-dummies), which is again a wrapper around the amazing work done by Chris (https://github.com/crr0004/deepracer). With the introduction of the second generation Deepracer Console the repository has been split up. This repository contains the scripts needed to *run* the training, but depends on Docker Hub to provide pre-built docker images. All the under-the-hood building capabilities have been moved to my [Deepracer Build](https://github.com/aws-deepracer-community/deepracer) repository. + +# Main Features + +DRfC supports a wide set of features to ensure that you can focus on creating the best model: +* User-friendly + * Based on the continously updated community [Robomaker](https://github.com/aws-deepracer-community/deepracer-simapp) and [Sagemaker](https://github.com/aws-deepracer-community/deepracer-sagemaker-container) containers, supporting a wide range of CPU and GPU setups. + * Wide set of scripts (`dr-*`) enables effortless training. + * Detection of your AWS DeepRacer Console models; allows upload of a locally trained model to any of them. +* Modes + * Time Trial + * Object Avoidance + * Head-to-Bot +* Training + * Multiple Robomaker instances per Sagemaker (N:1) to improve training progress. + * Multiple training sessions in parallel - each being (N:1) if hardware supports it - to test out things in parallel. + * Connect multiple nodes together (Swarm-mode only) to combine the powers of multiple computers/instances. +* Evaluation + * Evaluate independently from training. + * Save evaluation run to MP4 file in S3. +* Logging + * Training metrics and trace files are stored to S3. + * Optional integration with AWS CloudWatch. + * Optional exposure of Robomaker internal log-files. +* Technology + * Supports both Docker Swarm (used for connecting multiple nodes together) and Docker Compose (used to support OpenGL) + +# Documentation + +* [Initial Installation](installation.md) +* [Upload Model to Console](upload.md) +* [Reference](reference.md) +* [Using multiple Robomaker workers](multi_worker.md) +* [Running multiple parallel experiments](multi_run.md) +* [GPU Accelerated OpenGL for Robomaker](opengl.md) +* [Having multiple GPUs in one Computer](multi_gpu.md) +* [Installing on Windows](windows.md) +* [Run a Head-to-Head Race](head-to-head.md) +* [Watching the car](video.md) + +# Support + +* For general support it is suggested to join the [AWS DeepRacing Community](https://deepracing.io/). The Community Slack has a channel #dr-training-local where the community provides active support. +* Create a GitHub issue if you find an actual code issue, or where updates to documentation would be required. diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 00000000..3de6c365 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,154 @@ +# Installing Deepracer-for-Cloud + +## Requirements + +Depending on your needs as well as specific needs of the cloud platform you can configure your VM to your liking. Both CPU-only as well as GPU systems are supported. + +**AWS**: + +* EC2 instance of type G3, G4, P2 or P3 - recommendation is g4dn.2xlarge - for GPU enabled training. C5 or M6 types - recommendation is c5.2xlarge - for CPU training. + * Ubuntu 20.04 + * Minimum 30 GB, preferred 40 GB of OS disk. + * Ephemeral Drive connected + * Minimum of 8 GB GPU-RAM if running with GPU. + * Recommended at least 6 VCPUs +* S3 bucket. Preferrably in same region as EC2 instance. +* The internal `sagemaker-local` docker network runs by default on `192.168.2.0/24`. Ensure that your AWS IPC does not overlap with this subnet. + +**Azure**: + +* N-Series VM that comes with NVIDIA Graphics Adapter - recommendation is NC6_Standard + * Ubuntu 20.04 + * Standard 30 GB OS drive is sufficient to get started. + * Recommended to add an additional 32 GB data disk if you want to use the Log Analysis container. + * Minimum 8 GB GPU-RAM + * Recommended at least 6 VCPUs +* Storage Account with one Blob container configured for Access Key authentication. + +**Local**: + +* A modern, comparatively powerful, Intel based system. + * Ubuntu 20.04, other Linux-dristros likely to work. + * 4 core-CPU, equivalent to 8 vCPUs; the more the better. + * NVIDIA Graphics adapter with minimum 8 GB RAM for Sagemaker to run GPU. Robomaker enabled GPU instances need ~1 GB each. + * System RAM + GPU RAM should be at least 32 GB. +* Running DRfC Ubuntu 20.04 on Windows using Windows Subsystem for Linux 2 is possible. See [Installing on Windows](windows.md) + +## Installation + +The package comes with preparation and setup scripts that would allow a turn-key setup for a fresh virtual machine. + +```shell +git clone https://github.com/aws-deepracer-community/deepracer-for-cloud.git +``` + +**For cloud setup** execute: + +```shell +cd deepracer-for-cloud && ./bin/prepare.sh +``` + +This will prepare the VM by partitioning additional drives as well as installing all prerequisites. After a reboot it will continuee to run `./bin/init.sh` setting up the full repository and downloading the core Docker images. Depending on your environment this may take up to 30 minutes. The scripts will create a file `DONE` once completed. + +The installation script will adapt `.profile` to ensure that all settings are applied on login. Otherwise run the activation with `source bin/activate.sh`. + +**For local install** it is recommended *not* to run the `bin/prepare.sh` script; it might do more changes than what you want. Rather ensure that all prerequisites are set up and run `bin/init.sh` directly. + +See also the [following article](https://awstip.com/deepracer-for-cloud-drfc-local-setup-3c6418b2c75a) for guidance. + +The Init Script takes a few parameters: + +| Variable | Description | +|----------|-------------| +| `-c ` | Sets the cloud version to be configured, automatically updates the `DR_CLOUD` parameter in `system.env`. Options are `azure`, `aws` or `local`. Default is `local` | +| `-a ` | Sets the architecture to be configured. Either `cpu` or `gpu`. Default is `gpu`. | + +## Environment Setup + +The initialization script will attempt to auto-detect your environment (`Azure`, `AWS` or `Local`), and store the outcome in the `DR_CLOUD` parameter in `system.env`. You can also pass in a `-c ` parameter to override it, e.g. if you want to run the minio-based `local` mode in the cloud. + +The main difference between the mode is based on authentication mechanisms and type of storage being configured. The next chapters will review each type of environment on its own. + +### AWS + +In AWS it is possible to set up authentication to S3 in two ways: Integrated sign-on using [IAM Roles](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) or using access keys. + +#### IAM Role + +To use IAM Roles: + +* An empty S3 bucket in the same region as the EC2 instance. +* An IAM Role that has permissions to: + * Access both the *new* S3 bucket as well as the DeepRacer bucket. + * AmazonVPCReadOnlyAccess + * AmazonKinesisVideoStreamsFullAccess if you want to stream to Kinesis + * CloudWatch +* An EC2 instance with the defined IAM Role assigned. +* Configure `system.env` as follows: + * `DR_LOCAL_S3_PROFILE=default` + * `DR_LOCAL_S3_BUCKET=` + * `DR_UPLOAD_S3_PROFILE=default` + * `DR_UPLOAD_S3_BUCKET=` +* Run `dr-update` for configuration to take effect. + +#### Manual setup + +For access with IAM user: + +* An empty S3 bucket in the same region as the EC2 instance. +* A real AWS IAM user set up with access keys: + * User should have permissions to access the *new* bucket as well as the dedicated DeepRacer S3 bucket. + * Use `aws configure` to configure this into the default profile. +* Configure `system.env` as follows: + * `DR_LOCAL_S3_PROFILE=default` + * `DR_LOCAL_S3_BUCKET=` + * `DR_UPLOAD_S3_PROFILE=default` + * `DR_UPLOAD_S3_BUCKET=` +* Run `dr-update` for configuration to take effect. + +### Azure + +Minio has deprecated the gateway feature that exposed an Azure Blob Storage as an S3 bucket. Azure mode now sets up minio in the same way as in local mode. + +If you want to use awscli (`aws`) to manually move files then use `aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 ...`, as this will set both `--profile` and `--endpoint-url` parameters to match your configuration. + +### Local + +Local mode runs a minio server that hosts the data in the `docker/volumes` directory. It is otherwise command-compatible with the Azure setup; as the data is accessible via Minio and not via native S3. + +In Local mode the script-set requires the following: + +* Configure the Minio credentials with `aws configure --profile minio`. The default configuration will use the `minio` profile to configure MINIO. You can choose any username or password, but username needs to be at least length 3, and password at least length 8. +* A real AWS IAM user configured with `aws configure` to enable upload of models into AWS DeepRacer. +* Configure `system.env` as follows: + * `DR_LOCAL_S3_PROFILE=default` + * `DR_LOCAL_S3_BUCKET=` + * `DR_UPLOAD_S3_PROFILE=default` + * `DR_UPLOAD_S3_BUCKET=` +* Run `dr-update` for configuration to take effect. + +## First Run + +For the first run the following final steps are needed. This creates a training run with all default values in + +* Define your custom files in `custom_files/` - samples can be found in `defaults` which you must copy over: + * `hyperparameters.json` - definining the training hyperparameters + * `model_metadata.json` - defining the action space and sensors + * `reward_function.py` - defining the reward function +* Upload the files into the bucket with `dr-upload-custom-files`. This will also start minio if required. +* Start training with `dr-start-training` + +After a while you will see the sagemaker logs on the screen. + +## Troubleshooting + +Here are some hints for troubleshooting specific issues you may encounter + +### Local training troubleshooting + +| Issue | Troubleshooting hint | +|------------- | ---------------------| +Get messages like "Sagemaker is not running" | Run `docker -ps a` to see if the containers are running or if they stopped due to some errors. If running after a fresh install, try restarting the system. +Check docker errors for specific container | Run `docker logs -f ` +Get message "Error response from daemon: could not choose an IP address to advertise since this system has multiple addresses on interface ..." when running `./bin/init.sh -c local -a cpu` | It means you have multiple IP addresses and you need to specify one within `./bin/init.sh`.
If you don't care which one to use, you can get the first one by running ```ifconfig \| grep $(route \| awk '/^default/ {print $8}') -a1 \| grep -o -P '(?<=inet ).*(?= netmask)```.
Edit `./bin/init.sh` and locate line `docker swarm init` and change it to `docker swarm init --advertise-addr `.
Rerun `./bin/init.sh -c local -a cpu` +I don't have any of the `dr-*` commands | Run `source bin/activate.sh`. diff --git a/docs/metrics.md b/docs/metrics.md new file mode 100644 index 00000000..180ce811 --- /dev/null +++ b/docs/metrics.md @@ -0,0 +1,40 @@ +# Realtime Metrics + +It is possible to collect and visualise real-time metrics using the optional telegraf/influxdb/grafana stack. + +```mermaid +flowchart TD + A(Robomaker) --> B(Telegraf) + B --> C(InfluxDB) + C --> D(Grafana) +``` + +When enabled the Robomaker containers will send UDP metrics to Telegraf, which enriches and stores the metrics in the InfluxDB timeseries database container. + +Grafana provides a presentation layer for interactive dashboards. + +## Initial config and start-up + +To enable the feature simply uncomment the lines in system.env for `DR_TELEGRAF_HOST` and `DR_TELEGRAF_PORT`. In most cases the default values should work without modification. + +Start the metrics docker stack using `dr-start-metrics`. + +Once running Grafana should be accessible via a web browser on port 3000, e.g http://localhost:3000 +The default username is `admin`, password `admin`. You will be prompted to set your own password on first login. + +*Note: Grafana can take 60-90 seconds to perform initial internal setup the first time it is started. The web UI will not be available until this is complete. You can check the status by viewing the grafana container logs if necessary.* + +The metrics stack will remain running until stopped (`dr-stop-metrics`) or the machine is rebooted. It does not need to be restarted in between training runs and should automatically pick up metrics from new models. + +## Using the dashboards + +A template dashboard is provided to show how to access basic deepracer metrics. You can use this dashboard as a base to build your own more customised dashboards. + +After connecting to the Grafana Web UI with a browser use the menu to browse to the Dashboards section. + +The template dashboard called `DeepRacer Training template` should be visible, showing graphs of reward, progress, and completed lap times. + +As this is an automatically provisioned dashboard you are not able to save changes to it, however you can copy it by clicking on the small cog icon to enter the dashboard settings page, and then clicking `Save as` to make an editable copy. + +A full user guide on how to work the dashboards is available on the [Grafana website](https://grafana.com/docs/grafana/latest/dashboards/use-dashboards/). + diff --git a/docs/multi_gpu.md b/docs/multi_gpu.md new file mode 100644 index 00000000..037a4de5 --- /dev/null +++ b/docs/multi_gpu.md @@ -0,0 +1,53 @@ +# Training on a Computer with more than one GPU + +In some cases you might end up with having a computer with more than one GPU. This may be common on a workstation +which may have one GPU for general graphics (e.g. GTX 10-series, RTX 20-series), as well as a data center GPU +like a Tesla K40, K80 or M40. + +In this setting it can get a bit chaotic as DeepRacer will 'greedily' put any workload on any GPU - which will +lead to Out-of-Memory somewhere down the road. + +## Checking available GPUs + +You can use Tensorflow to give you an overview of available devices running `utils/cuda-check.sh`. + +It will say something like: +``` +2020-07-04 12:25:55.179580: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA +2020-07-04 12:25:55.547206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: +name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.68 +pciBusID: 0000:04:00.0 +totalMemory: 3.82GiB freeMemory: 3.30GiB +2020-07-04 12:25:55.732066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 1 with properties: +name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate(GHz): 1.112 +pciBusID: 0000:81:00.0 +totalMemory: 22.41GiB freeMemory: 22.30GiB +2020-07-04 12:25:55.732141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0, 1 +2020-07-04 12:25:56.745647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: +2020-07-04 12:25:56.745719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 1 +2020-07-04 12:25:56.745732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N N +2020-07-04 12:25:56.745743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1: N N +2020-07-04 12:25:56.745973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 195 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:04:00.0, compute capability: 7.5) +2020-07-04 12:25:56.750352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 1147 MB memory) -> physical GPU (device: 1, name: Tesla M40 24GB, pci bus id: 0000:81:00.0, compute capability: 5.2) +2020-07-04 12:25:56.774305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0, 1 +2020-07-04 12:25:56.774408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: +2020-07-04 12:25:56.774425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 1 +2020-07-04 12:25:56.774436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N N +2020-07-04 12:25:56.774446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1: N N +2020-07-04 12:25:56.774551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:0 with 195 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:04:00.0, compute capability: 7.5) +2020-07-04 12:25:56.774829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:1 with 1147 MB memory) -> physical GPU (device: 1, name: Tesla M40 24GB, pci bus id: 0000:81:00.0, compute capability: 5.2) +['/device:GPU:0', '/device:GPU:1'] +``` +In this case the CUDA device #0 is the GTX 1650 and the CUDA device #1 is the Tesla M40. + +### Selecting Device + +To control the CUDA assignment for Sagemaker abd Robomaker then the following to variables in `system.env`: + +``` +DR_ROBOMAKER_CUDA_DEVICES=0 +DR_SAGEMAKER_CUDA_DEVICES=1 +``` + +The number is the CUDA number of the GPU you want the containers to use. + diff --git a/docs/multi_run.md b/docs/multi_run.md new file mode 100644 index 00000000..d0e6de38 --- /dev/null +++ b/docs/multi_run.md @@ -0,0 +1,17 @@ +# Running Multiple Experiments + +It is possible to run multiple experiments on one computer in parallel. This is possible both in `swarm` and `compose` mode, and is controlled by `DR_RUN_ID` in `run.env`. + +The feature works by creating unique prefixes to the container names: +* In Swarm mode this is done through defining a stack name (default: deepracer-0) +* In Compose mode this is done through adding a project name. + +## Suggested way to use the feature + +By default `run.env` is loaded when DRfC is activated - but it is possible to load a separate configuration through `source bin/activate.sh `. + +The best way to use this feature is to have a bash-shell per experiment, and to load a separate configuration per shell. + +After activating one can control each experiment independently through using the `dr-*` commands. + +If using local or Azure the S3 / Minio instance will be shared, and is running only once. \ No newline at end of file diff --git a/docs/multi_worker.md b/docs/multi_worker.md new file mode 100644 index 00000000..23f09094 --- /dev/null +++ b/docs/multi_worker.md @@ -0,0 +1,25 @@ +# Using multiple Robomaker workers + +One way to accelerate training is to launch multiple Robomaker workers that feed into one Sagemaker instance. + +The number of workers is configured through setting `system.env` `DR_WORKERS` to the desired number of workers. The result is that the number of episodes (hyperparameter `num_episodes_between_training`) will be divivided over the number of workers. The theoretical maximum number of workers equals `num_episodes_between_training`. + +The training can be started as normal. + +## How many workers do I need? + +One Robomaker worker requires 2-4 vCPUs. Tests show that a `c5.4xlarge` instance can run 3 workers and the Sagemaker without a drop in performance. Using OpenGL images reduces the number of vCPUs required per worker. + +To avoid issues with the position from which evaluations are run ensure that `( num_episodes_between_training / DR_WORKERS) * DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST = 1.0`. + +Example: With 3 workers set `num_episodes_between_training: 30` and `DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST=0.1`. + +Note; Sagemaker will stop collecting experiences once you have reached 10.000 steps (3-layer CNN) in an iteration. For longer tracks with 600-1000 steps per completed episodes this will define the upper bound for the number of workers and episodes per iteration. + +## Training with different parameters for each worker + +It is also possible to use different configurations between workers, such as different tracks (WORLD_NAME). To enable, set DR_TRAIN_MULTI_CONFIG=True inside run.env, then make copies of defaults/template-worker.env in the main deepracer-for-cloud directory with format worker-2.env, worker-3.env, etc. (So alongside run.env, you should have woker-2.env, worker-3.env, etc. run.env is still used for worker 1) Modify the worker env files with your desired changes, which can be more than just the world_name. These additional worker env files are only used if you are training with multiple workers. + +## Watching the streams + +If you want to watch the streams -- and are in `compose` mode you can use the script `utils/start-local-browser.sh` to dynamically create a HTML that streams the KVS stream from ALL workers at a time. diff --git a/docs/opengl.md b/docs/opengl.md new file mode 100644 index 00000000..6f3a94c2 --- /dev/null +++ b/docs/opengl.md @@ -0,0 +1,62 @@ +# GPU Accelerated OpenGL for Robomaker + +One way to improve performance, especially of Robomaker, is to enable GPU-accelerated OpenGL. OpenGL can significantly improve Gazebo performance, even where the GPU does not have enough GPU RAM, or is too old, to support Tensorflow. + +## Desktop + +On a Ubuntu desktop running Unity there are hardly any additional steps required. + +* Ensure that a recent Nvidia driver is installed and is running. +* Ensure that nvidia-docker is installed; review `bin/prepare.sh` for steps if you do not want to directly run the script. +* Configure DRfC using the following settings in `system.env`: + * `DR_HOST_X=True`; uses the local X server rather than starting one within the docker container. + * `DR_DISPLAY`; set to the value of your running X server, if not set then `DISPLAY` will be used. + +Before running `dr-start-training`/`dr-start-evaluation` ensure that `DR_DISPLAY`/`DISPLAY` and `XAUTHORITY` are defined. + +Check that OpenGL is working by looking for `gzserver` in `nvidia-smi`. + +If `DR_GUI_ENABLE=True` then the Gazebo UI, rviz and rqt will open up in separate windows. (With multiple workers it can get crowded...) + +### Remote connection to Desktop + +If you want to start training or evaluation via SSH (e.g. to increment the training whilst you are on the go) there are a few steps to do: +* Ensure that you are actually logged in to the local machine (desktop session is running). +* In the SSH terminal: + * Ensure `DR_DISPLAY` is configured in `system.env`. Otherwise run `export DISPLAY=:1`. [*] + * Run `export XAUTHORITY=/run/user/$(id -u)/gdm/Xauthority` to let X know where the X magic cookie is. + * Run `source bin/activate.sh` as normal. + * Run your `dr-start-training` or `dr-start-evaluation` command. + +*Remark*: Setting `DISPLAY` will lead to certain commands (e.g. `dr-logs-sagemaker`) starting in a terminal window on the desktop, rather than the output being showhn in the SSH terminal. +Use of `DR_DISPLAY` is recommended to avoid this. + +## Headless Server + +Also a headless server with a GPU, e.g. an EC2 instance, or a local computer with a displayless GPU (e.g. Tesla K40, K80, M40). + +This also applies for a desktop computer where you are not logged in. In this case also disconnect any monitor cables to avoid conflict. + +* Ensure that a Nvidia driver and nvidia-docker is installed; review `bin/prepare.sh` for steps if you do not want to directly run the script. +* Setup an X-server on the host. `utils/setup-xorg.sh` is a basic installation script. +* Configure DRfC using the following settings in `system.env`: + * `DR_HOST_X=True`; uses the local X server rather than starting one within the docker container. + * `DR_DISPLAY`; the X display that the headless X server will start on. (Default is `:99`, avoid using `:0` or `:1` as it may conflict with other X servers.) + +Start up the X server with `utils/start-xorg.sh`. + +If `DR_GUI_ENABLE=True` then a VNC server will be started on port 5900 so that you can connect and interact with the Gazebo UI. + +Check that OpenGL is working by looking for `gzserver` in `nvidia-smi`. + +## WSL2 on Windows 11 + +OpenGL is also supported in WSL2 on Windows 11. By default an Xwayland server is started in Ubuntu 22.04. + +To enable OpenGL acceleration perform the following steps: +* Install x11-server-utils with `sudo apt install x11-xserver-utils`. +* Configure DRfC using the following settings in `system.env`: + * `DR_HOST_X=True`; uses the local X server rather than starting one within the docker container. + * `DR_DISPLAY=:0`; the Xwayland starts on :0 by default. + +If you want to interact with the Gazebo UI, set `DR_DOCKER_STYLE=compose` and `DR_GUI_ENABLE=True` in `system.env`. diff --git a/docs/reference.md b/docs/reference.md new file mode 100644 index 00000000..93318f7e --- /dev/null +++ b/docs/reference.md @@ -0,0 +1,101 @@ +# Deepracer-for-Cloud Reference + +## Environment Variables + +The scripts assume that two files `system.env` containing constant configuration values and `run.env` with run specific values is populated with the required values. Which values go into which file is not really important. + +| Variable | Description | +|----------|-------------| +| `DR_RUN_ID` | Used if you have multiple independent training jobs only a single DRfC instance. This is an advanced configuration and generally you should just leave this as the default `0`.| +| `DR_WORLD_NAME` | Defines the track to be used.| +| `DR_RACE_TYPE` | Valid options are `TIME_TRIAL`, `OBJECT_AVOIDANCE`, and `HEAD_TO_BOT`.| +| `DR_CAR_COLOR` | Valid options are `Black`, `Grey`, `Blue`, `Red`, `Orange`, `White`, and `Purple`.| +| `DR_CAR_NAME` | Display name of car; shows in Deepracer Console when uploading.| +| `DR_ENABLE_DOMAIN_RANDOMIZATION` | If `True`, this cycles through different environment colors and lighting each episode. This is typically used to make your model more robust and generalized instead of tightly aligned with the simulator| +| `DR_UPLOAD_S3_PREFIX` | Prefix of the target location. (Typically starts with `DeepRacer-SageMaker-RoboMaker-comm-`| +| `DR_EVAL_NUMBER_OF_TRIALS` | How many laps to complete for evaluation simulations.| +| `DR_EVAL_IS_CONTINUOUS` | If False, your evaluation trial will end if you car goes off track or is in a collision. If True, your car will take the penalty times as configured in those parameters, but continue evaluating the trial.| +| `DR_EVAL_OFF_TRACK_PENALTY` | Number of seconds penalty time added for an off track during evaluation. Only takes effect if `DR_EVAL_IS_CONTINUOUS` is set to True.| +| `DR_EVAL_COLLISION_PENALTY` | Number of seconds penalty time added for a collision during evaluation. Only takes effect if `DR_EVAL_IS_CONTINUOUS` is set to True.| +| `DR_EVAL_SAVE_MP4` | Set to `True` to save MP4 of an evaluation run. | +| `DR_EVAL_REVERSE_DIRECTION` | Set to `True` to reverse the direction in which the car traverses the track.| +| `DR_TRAIN_CHANGE_START_POSITION` | Determines if the racer shall round-robin the starting position during training sessions. (Recommended to be `True` for initial training.)| +| `DR_TRAIN_ALTERNATE_DRIVING_DIRECTION` | `True` or `False`. If `True`, the car will alternate driving between clockwise and counter-clockwise each episode.| +| `DR_TRAIN_START_POSITION_OFFSET` | Used to control where to start the training from on first episode.| +| `DR_TRAIN_ROUND_ROBIN_ADVANCE_DISTANCE` | How far to progress each episode in round robin. 0.05 is 5% of the track. Generally best to try and keep this to even numbers that match with your total number of episodes to allow for even distribution around the track. For example, if 20 episodes per iternation, .05 or .10 or .20 would be good.| +| `DR_TRAIN_MULTI_CONFIG` | `True` or `False`. This is used if you want to use different run.env configurations for each worker in a multi worker training run. See multi config documentation for more details on how to set this up.| +| `DR_TRAIN_MIN_EVAL_TRIALS` | The minimum number of evaluation trials run between each training iteration. Evaluations will continue as long as policy training is occuring and may be more than this number. This establishes the minimum, and is generally useful if you want to speed up training especially when using gpu sagemaker containers.| +| `DR_TRAIN_REVERSE_DIRECTION` | Set to `True` to reverse the direction in which the car traverses the track. | +| `DR_TRAIN_BEST_MODEL_METRIC` | Can be used to control which model is kept as the "best" model. Set to `progress` to select the model with the highest evaluation completion percentage, set to `reward` to select the model with the highest evaluation reward.| +| `DR_TRAIN_MAX_STEPS_PER_ITERATION` | Can be used to control the max number of steps per iteration to use for learning, the excess steps will be discarded to avoid out-of-memory situations, default is 10000. | +| `DR_LOCAL_S3_PRETRAINED` | Determines if training or evaluation shall be based on the model created in a previous session, held in `s3://{DR_LOCAL_S3_BUCKET}/{LOCAL_S3_PRETRAINED_PREFIX}`, accessible by credentials held in profile `{DR_LOCAL_S3_PROFILE}`.| +| `DR_LOCAL_S3_PRETRAINED_PREFIX` | Prefix of pretrained model within S3 bucket.| +| `DR_LOCAL_S3_MODEL_PREFIX` | Prefix of model within S3 bucket.| +| `DR_LOCAL_S3_BUCKET` | Name of S3 bucket which will be used during the session.| +| `DR_LOCAL_S3_CUSTOM_FILES_PREFIX` | Prefix of configuration files within S3 bucket.| +| `DR_LOCAL_S3_TRAINING_PARAMS_FILE` | Name of YAML file that holds parameters sent to robomaker container for configuration during training. Filename is relative to `s3://{DR_LOCAL_S3_BUCKET}/{LOCAL_S3_PRETRAINED_PREFIX}`.| +| `DR_LOCAL_S3_EVAL_PARAMS_FILE` | Name of YAML file that holds parameters sent to robomaker container for configuration during evaluations. Filename is relative to `s3://{DR_LOCAL_S3_BUCKET}/{LOCAL_S3_PRETRAINED_PREFIX}`.| +| `DR_LOCAL_S3_MODEL_METADATA_KEY` | Location where the `model_metadata.json` file is stored.| +| `DR_LOCAL_S3_HYPERPARAMETERS_KEY` | Location where the `hyperparameters.json` file is stored.| +| `DR_LOCAL_S3_REWARD_KEY` | Location where the `reward_function.py` file is stored.| +| `DR_LOCAL_S3_METRICS_PREFIX` | Location where the metrics will be stored.| +| `DR_OA_NUMBER_OF_OBSTACLES` | For Object Avoidance, the number of obstacles on the track.| +| `DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES` | Minimum distance in meters between obstacles.| +| `DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS` | If True, obstacle locations will randomly change after each episode.| +| `DR_OA_IS_OBSTACLE_BOT_CAR` | If True, obstacles will appear as a stationary car instead of a box.| +| `DR_OA_OBJECT_POSITIONS` | Positions of boxes on the track. Tuples consisting of progress (fraction [0..1]) and inside or outside lane (-1 or 1). Example: `"0.23,-1;0.46,1"`| +| `DR_H2B_IS_LANE_CHANGE` | If True, bot cars will change lanes based on configuration.| +| `DR_H2B_LOWER_LANE_CHANGE_TIME` | Minimum time in seconds before car will change lanes.| +| `DR_H2B_UPPER_LANE_CHANGE_TIME` | Maximum time in seconds before car will change langes.| +| `DR_H2B_LANE_CHANGE_DISTANCE` | Distance in meters how long it will take the car to change lanes.| +| `DR_H2B_NUMBER_OF_BOT_CARS` | Number of bot cars on the track.| +| `DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS` | Minimum distance between bot cars.| +| `DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS` | If True, bot car locations will randomly change after each episode.| +| `DR_H2B_BOT_CAR_SPEED` | How fast the bot cars go in meters per second.| +| `DR_CLOUD` | Can be `azure`, `aws`, `local` or `remote`; determines how the storage will be configured.| +| `DR_AWS_APP_REGION` | (AWS only) Region for other AWS resources (e.g. Kinesis) | +| `DR_UPLOAD_S3_PROFILE` | AWS Cli profile to be used that holds the 'real' S3 credentials needed to upload a model into AWS DeepRacer.| +| `DR_UPLOAD_S3_BUCKET` | Name of the AWS DeepRacer bucket where models will be uploaded. (Typically starts with `aws-deepracer-`.)| +| `DR_LOCAL_S3_PROFILE` | Name of AWS profile with credentials to be used. Stored in `~/.aws/credentials` unless AWS IAM Roles are used.| +| `DR_GUI_ENABLE` | Enable or disable the Gazebo GUI in Robomaker | +| `DR_KINESIS_STREAM_NAME` | Kinesis stream name. Used if you actually publish to the AWS KVS service. Leave blank if you do not want this. | +| `DR_KINESIS_STREAM_ENABLE` | Enable or disable 'Kinesis Stream', True both publishes to a AWS KVS stream (if name not None), and to the topic `/racecar/deepracer/kvs_stream`. Leave True if you want to watch the car racing. | +| `DR_SAGEMAKER_IMAGE` | Determines which sagemaker image will be used for training.| +| `DR_ROBOMAKER_IMAGE` | Determines which robomaker image will be used for training or evaluation.| +| `DR_MINIO_IMAGE` | Determines which Minio image will be used. | +| `DR_COACH_IMAGE` | Determines which coach image will be used for training.| +| `DR_WORKERS` | Number of Robomaker workers to be used for training. See additional documentation for more information about this feature.| +| `DR_ROBOMAKER_MOUNT_LOGS` | True to get logs mounted to `$DR_DIR/data/logs/robomaker/$DR_LOCAL_S3_MODEL_PREFIX`| +| `DR_ROBOMAKER_MOUNT_SIMAPP_DIR` | Path to the altered Robomaker bundle, e.g. `/home/ubuntu/deepracer-simapp/bundle`.| +| `DR_CLOUD_WATCH_ENABLE` | Send log files to AWS CloudWatch.| +| `DR_CLOUD_WATCH_LOG_STREAM_PREFIX` | Add a prefix to the CloudWatch log stream name.| +| `DR_DOCKER_STYLE` | Valid Options are `Swarm` and `Compose`. Use Compose for openGL optimized containers.| +| `DR_HOST_X` | Uses the host X-windows server, rather than starting one inside of Robomaker. Required for OpenGL images.| +| `DR_WEBVIEWER_PORT` | Port for the web-viewer proxy which enables the streaming of all robomaker workers at once.| +| `CUDA_VISIBLE_DEVICES` | Used in multi-GPU configurations. See additional documentation for more information about this feature.| +| `DR_TELEGRAF_HOST` | The hostname to send real-time metrics to. Uncommenting this will enable real-time metrics collection using Telegraf. The telegraf/influxdb/grafana compose stack must already be running (use `dr-start-metrics`) for this to work, and it should usually be set to `telegraf` to send metrics to the telegraf container. +| `DR_TELEGRAF_PORT` | Defines the UDP port to send real-time metrics to. Should usually remain set as 8092. + +## Commands + +| Command | Description | +|---------|-------------| +| `dr-update` | Loads in all scripts and environment variables again.| +| `dr-update-env` | Loads in all environment variables from `system.env` and `run.env`.| +| `dr-upload-custom-files` | Uploads changed configuration files from `custom_files/` into `s3://{DR_LOCAL_S3_BUCKET}/custom_files`.| +| `dr-download-custom-files` | Downloads changed configuration files from `s3://{DR_LOCAL_S3_BUCKET}/custom_files` into `custom_files/`.| +| `dr-start-training` | Starts a training session in the local VM based on current configuration.| +| `dr-increment-training` | Updates configuration, setting the current model prefix to pretrained, and incrementing a serial.| +| `dr-stop-training` | Stops the current local training session. Uploads log files.| +| `dr-start-evaluation` | Starts a evaluation session in the local VM based on current configuration.| +| `dr-stop-evaluation` | Stops the current local evaluation session. Uploads log files.| +| `dr-start-loganalysis` | Starts a Jupyter log-analysis container, available on port 8888.| +| `dr-stop-loganalysis` | Stops the Jupyter log-analysis container.| +| `dr-start-viewer` | Starts an NGINX proxy to stream all the robomaker streams; accessible remotly.| +| `dr-stop-viewer` | Stops the NGINX proxy.| +| `dr-logs-sagemaker` | Displays the logs from the running Sagemaker container.| +| `dr-logs-robomaker` | Displays the logs from the running Robomaker container.| +| `dr-list-aws-models` | Lists the models that are currently stored in your AWS DeepRacer S3 bucket. | +| `dr-set-upload-model` | Updates the `run.env` with the prefix and name of your selected model. | +| `dr-upload-model` | Uploads the model defined in `DR_LOCAL_S3_MODEL_PREFIX` to the AWS DeepRacer S3 prefix defined in `DR_UPLOAD_S3_PREFIX` | +| `dr-download-model` | Downloads a file from a 'real' S3 location into a local prefix of choice. | diff --git a/docs/upload.md b/docs/upload.md new file mode 100644 index 00000000..07a2a657 --- /dev/null +++ b/docs/upload.md @@ -0,0 +1,42 @@ +# Upload Model to AWS Console + +Starting end July 2020 the AWS DeepRacer Console was re-designed which is now changing the way +that models need to be uploaded to enable them to be evaluated or submitted to the AWS hosted Summit or Virtual League events. + +## Create Upload Bucket + +The recommendation is to create a unique bucket in `us-east-1` which is used as 'transit' between your training bucket, local or in an AWS region close to your EC2 instances. + +The bucket needs to be defined so that 'Objects can be public'; AWS will create a specific IAM policy to access the data in your bucket as part of the import. + +## Configure Upload Bucket + +In `system.env` set `DR_UPLOAD_S3_BUCKET` to the name of your created bucket. + +In `run.env` set the `DR_UPLOAD_S3_PREFIX` to any prefix of your choice. + +## Upload Model + +After configuring the system you can run `dr-upload-model`; it will copy out the required parts of `s3://DR_LOCAL_S3_BUCKET/DR_LOCAL_S3_PREFIX` into `s3://DR_UPLOAD_S3_BUCKET/DR_UPLOAD_S3_PREFIX`. + +Once uploaded you can use the [Import model](https://console.aws.amazon.com/deepracer/home?region=us-east-1#models/importModel) feature of the AWS DeepRacer console to load the model into the model store. + +## Things to know + +### Upload switches +There are several useful switches to the upload command: + * f - this will force upload, no confirmation question if you want to proceed with upload + * w - wipes the target AWS DeepRacer model structure before upload in the designated bucket/prefix + * d - dry-Run mode, does not perform any write or delete operatios on target + * b - uploads best checkpoint instead of default which is last checkpoint + * p prefix - uploads model into specified S3 prefix + +### Import +As the AWS service is no longer available, import is no longer possible. Upload now merely serves to upload to a different S3 bucket. + +### Managing your models +You should decide how you're going to manage your models. Upload to AWS does not preserve all the files created locally so if you delete your local files you will find it hard to go back to a previous model and resume training. + +### Create file formatted for physical car, and upload to S3 +You can also create the file in the format necessary to run on the physical car directly from DRfC, without going through the AWS console. +This is executed by running 'dr-upload-car-zip'; it will copy files out of the running sagemaker container, format them into the proper .tar.gz file, and upload that file to `s3://DR_LOCAL_S3_BUCKET/DR_LOCAL_S3_PREFIX`. One of the limitations of this approach is that it only uses the latest checkpoint, and does not have the option to use the "best" checkpoint, or an earlier checkpoint. Another limitation is that the sagemaker container must be running at the time this command is executed. diff --git a/docs/video.md b/docs/video.md new file mode 100644 index 00000000..c4a1f882 --- /dev/null +++ b/docs/video.md @@ -0,0 +1,36 @@ +# Watching the car + +There are multiple ways to watch the car during training and evaluation. The ports and 'features' depend on the docker mode (swarm vs. compose) as well as between training and evaluation. + +## Training using Viewer + +DRfC has a built in viewer that supports showing the video stream from up to 6 workers on one webpage. + +The view can be started with `dr-start-viewer` and is available on `http://localhost:8100` or `http://127.0.0.1:8100`. The viewer must be updated if training is restarted using `dr-update-viewer`, as it needs to connect to the new containers. + +It is also possible to automatically start/update the viewer using the `-v` flag to `dr-start-training`. + +## ROS Stream Viewer + +The ROS Stream Viewer is a built in ROS feature that will stream any topic in ROS that publishing ROSImg messages. The viewer starts automatically. + +### Ports + +| Docker Mode | Training | Evaluation | Comment +| -------- | -------- | -------- | -------- | +| swarm | 8080 + `DR_RUN_ID` | 8180 + `DR_RUN_ID` | Default 8080/8180. Multiple workers share one port, press F5 to cycle between them. +| compose | 8080-8089 | 8080-8089 | Each worker gets a unique port. + +### Topics + +| Topic | Description | +| -------- | -------- | +| `/racecar/camera/zed/rgb/image_rect_color` | In-car video stream. This is used for inference. | +| `/racecar/main_camera/zed/rgb/image_rect_color` | Camera following the car. Stream without overlay | +| `/sub_camera/zed/rgb/image_rect_color` | Top-view of the track | +| `/racecar/deepracer/kvs_stream` | Camera following the car. Stream with overlay. Different overlay in Training and Evaluation | +| `/racecar/deepracer/main_camera_stream` | Same as `kvs_stream`, topic used for MP4 production. Only active in Evaluation if `DR_EVAL_SAVE_MP4=True` | + +## Saving Evaluation to File + +During evaluation (`dr-start-evaluation`), if `DR_EVAL_SAVE_MP4=True` then three MP4 files are created in the S3 bucket's MP4 folder. They contain the in-car camera, top-camera and the camera following the car. \ No newline at end of file diff --git a/docs/windows.md b/docs/windows.md new file mode 100644 index 00000000..f6da6ae1 --- /dev/null +++ b/docs/windows.md @@ -0,0 +1,76 @@ +# Installing on Windows + +## Prerequisites + +The basic installation steps to get a NVIDIA GPU / CUDA enabled Ubuntu subsystem on Windows can be found in the [Cuda on WSL User Guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html). Ensure your windows has an updated [nvidia cuda enabled driver](https://developer.nvidia.com/cuda/wsl/download) that will work with WSL. + +The further instructions assume that you have a basic working WSL using the default Ubuntu distribution. + + +## Additional steps + +The typical `bin/prepare.sh` script will not work for a Ubuntu WSL installation, hence alternate steps will be required. + +### Adding required packages + +Install additional packages with the following command: + +``` +sudo apt-get install jq awscli python3-boto3 docker-compose +``` + +### Install and configure docker and nvidia-docker +``` +curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - +sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" +sudo apt-get update && sudo apt-get install -y --no-install-recommends docker-ce docker-ce-cli containerd.io + +distribution=$(. /etc/os-release;echo $ID$VERSION_ID) +curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - +curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list + +cat /etc/docker/daemon.json | jq 'del(."default-runtime") + {"default-runtime": "nvidia"}' | sudo tee /etc/docker/daemon.json +sudo usermod -a -G docker $(id -un) +``` + + +### Install DRfC + +You can now run `bin/init.sh -a gpu -c local` to setup DRfC, and follow the typical DRfC startup instructions + +## Known Issues + +* `init.sh` is not able to detect the GPU given differences in the Nvidia drivers, and the WSL2 Linux Kernel. You need to manually set the GPU image in `system.env`. +* Docker does not start automatically when you launch Ubuntu. Start it manually with `sudo service docker start` + + You can also configure the service to start automatically using the Windows Task Scheduler + + *1)* Create a new file at /etc/init-wsl (sudo vi /etc/init-wsl) with the following contents. + + ``` + #!/bin/sh + service start docker + ``` + + *2)* Make the script executable `sudo chmod +x /etc/init-wsl` + + *3)* Open Task Scheduler in Windows 10 + + - On the left, click **Task Scheduler Library** option, and then on the right, click **Create Task** + + - In **General** Tab, Enter Name **WSL Startup**, and select **Run whether user is logged on or not** and **Run with highest privileges** options. + + - In **Trigger** tab, click New ... > Begin the task: **At startup** > OK + + - In **Actions** tab, click New ... > Action: **Start a program** + + program/script: **wsl** + + add arguments: **-u root /etc/init-wsl** + + - Click OK to exit + + *4)* You can run the task manually to confirm, or after Windows reboot docker should now automatically start. + +* Video streams may not load using the localhost address. To access the html video streams from your windows browser, you may need to use the IP address of the WSL VM. From a WSL terminal, determine your IP address by the command 'ip addr' and look for **eth0** then **inet** (e.g. ip = 172.29.38.21). Then from your windows browser (edge, chrome, etc) navigate to **ip:8080** (e.g. 172.29.38.21:8080) + diff --git a/init.sh b/init.sh deleted file mode 100755 index 2a63272e..00000000 --- a/init.sh +++ /dev/null @@ -1,38 +0,0 @@ -#!/usr/bin/env bash - -# create directory structure for docker volumes -mkdir -p docker/volumes/minio/bucket/custom_files \ - docker/volumes/robo/checkpoint - -# create symlink to current user's home .aws directory -# NOTE: AWS cli must be installed for this to work -# https://docs.aws.amazon.com/cli/latest/userguide/install-linux-al2017.html -ln -s $(eval echo "~${USER}")/.aws docker/volumes/ - -# grab local training deepracer repo from crr0004 and log analysis repo from vreadcentric -git clone --recurse-submodules https://github.com/crr0004/deepracer.git - -git clone https://github.com/breadcentric/aws-deepracer-workshops.git && cd aws-deepracer-workshops && git checkout enhance-log-analysis && cd .. - -ln -s ../../aws-deepracer-workshops/log-analysis ./docker/volumes/log-analysis - -# setup symlink to rl-coach config file -ln -s deepracer/rl_coach/rl_deepracer_coach_robomaker.py rl_deepracer_coach_robomaker.py - -# replace the contents of the rl_deepracer_coach_robomaker.py file with the gpu specific version (this is also where you can edit the hyperparameters) -# TODO this file should be genrated from a gui before running training -cat overrides/rl_deepracer_coach_robomaker.py > rl_deepracer_coach_robomaker.py - -# build rl-coach image with latest code from crr0004's repo -docker build -f ./docker/dockerfiles/rl_coach/Dockerfile -t aschu/rl_coach deepracer/ - -# copy reward function and model-metadata files to bucket -cp deepracer/custom_files/* docker/volumes/minio/bucket/custom_files/ - -# create the network sagemaker-local if it doesn't exit -SAGEMAKER_NW='sagemaker-local' -docker network ls | grep -q $SAGEMAKER_NW -if [ $? -ne 0 ] -then - docker network create $SAGEMAKER_NW -fi diff --git a/overrides/rl_deepracer_coach_robomaker.py b/overrides/rl_deepracer_coach_robomaker.py deleted file mode 100644 index 9c96c7c6..00000000 --- a/overrides/rl_deepracer_coach_robomaker.py +++ /dev/null @@ -1,140 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - - -import sagemaker -import boto3 -import sys -import os -import glob -import re -import subprocess -from IPython.display import Markdown -from time import gmtime, strftime -sys.path.append("common") -from misc import get_execution_role, wait_for_s3_object -from sagemaker.rl import RLEstimator, RLToolkit, RLFramework -from markdown_helper import * - - - -# S3 bucket -boto_session = boto3.session.Session( - aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID", "minio"), - aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY", "miniokey"), - region_name=os.environ.get("AWS_REGION", "us-east-1")) -s3Client = boto_session.resource("s3", use_ssl=False, -endpoint_url=os.environ.get("S3_ENDPOINT_URL", "http://127.0.0.1:9000")) - -sage_session = sagemaker.local.LocalSession(boto_session=boto_session, s3_client=s3Client) -s3_bucket = os.environ.get("MODEL_S3_BUCKET", "bucket") #sage_session.default_bucket() -s3_output_path = 's3://{}/'.format(s3_bucket) # SDK appends the job name and output folder - -# ### Define Variables - -# We define variables such as the job prefix for the training jobs and s3_prefix for storing metadata required for synchronization between the training and simulation jobs - - -job_name_prefix = 'rl-deepracer' # this should be MODEL_S3_PREFIX, but that already ends with "-sagemaker" - -# create unique job name -tm = gmtime() -job_name = s3_prefix = job_name_prefix + "-sagemaker"#-" + strftime("%y%m%d-%H%M%S", tm) #Ensure S3 prefix contains SageMaker -s3_prefix_robomaker = job_name_prefix + "-robomaker"#-" + strftime("%y%m%d-%H%M%S", tm) #Ensure that the S3 prefix contains the keyword 'robomaker' - - -# Duration of job in seconds (5 hours) -job_duration_in_seconds = 24 * 60 * 60 - -aws_region = sage_session.boto_region_name - -if aws_region not in ["us-west-2", "us-east-1", "eu-west-1"]: - raise Exception("This notebook uses RoboMaker which is available only in US East (N. Virginia), US West (Oregon) and EU (Ireland). Please switch to one of these regions.") -print("Model checkpoints and other metadata will be stored at: {}{}".format(s3_output_path, job_name)) - - -s3_location = "s3://%s/%s" % (s3_bucket, s3_prefix) -print("Uploading to " + s3_location) - - -metric_definitions = [ - # Training> Name=main_level/agent, Worker=0, Episode=19, Total reward=-102.88, Steps=19019, Training iteration=1 - {'Name': 'reward-training', - 'Regex': '^Training>.*Total reward=(.*?),'}, - - # Policy training> Surrogate loss=-0.32664725184440613, KL divergence=7.255815035023261e-06, Entropy=2.83156156539917, training epoch=0, learning_rate=0.00025 - {'Name': 'ppo-surrogate-loss', - 'Regex': '^Policy training>.*Surrogate loss=(.*?),'}, - {'Name': 'ppo-entropy', - 'Regex': '^Policy training>.*Entropy=(.*?),'}, - - # Testing> Name=main_level/agent, Worker=0, Episode=19, Total reward=1359.12, Steps=20015, Training iteration=2 - {'Name': 'reward-testing', - 'Regex': '^Testing>.*Total reward=(.*?),'}, -] - - -# We use the RLEstimator for training RL jobs. -# -# 1. Specify the source directory which has the environment file, preset and training code. -# 2. Specify the entry point as the training code -# 3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container. -# 4. Define the training parameters such as the instance count, instance type, job name, s3_bucket and s3_prefix for storing model checkpoints and metadata. **Only 1 training instance is supported for now.** -# 4. Set the RLCOACH_PRESET as "deepracer" for this example. -# 5. Define the metrics definitions that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. - -# In[ ]: - - -RLCOACH_PRESET = "deepracer" - -gpu_available = os.environ.get("GPU_AVAILABLE", False) -# 'local' for cpu, 'local_gpu' for nvidia gpu (and then you don't have to set default runtime to nvidia) -instance_type = "local_gpu" if gpu_available else "local" -image_name = "crr0004/sagemaker-rl-tensorflow:{}".format( - "nvidia" if gpu_available else "console") - -estimator = RLEstimator(entry_point="training_worker.py", - source_dir='src', - dependencies=["common/sagemaker_rl"], - toolkit=RLToolkit.COACH, - toolkit_version='0.11', - framework=RLFramework.TENSORFLOW, - sagemaker_session=sage_session, - #bypass sagemaker SDK validation of the role - role="aaa/", - train_instance_type=instance_type, - train_instance_count=1, - output_path=s3_output_path, - base_job_name=job_name_prefix, - image_name=image_name, - train_max_run=job_duration_in_seconds, # Maximum runtime in seconds - hyperparameters={"s3_bucket": s3_bucket, - "s3_prefix": s3_prefix, - "aws_region": aws_region, - "model_metadata_s3_key": "s3://{}/custom_files/model_metadata.json".format(s3_bucket), - "RLCOACH_PRESET": RLCOACH_PRESET, - "batch_size": 64, - "num_epochs": 10, - "stack_size" : 1, - "lr" : 0.00035, - "exploration_type" : "categorical", - "e_greedy_value" : 0.05, - "epsilon_steps" : 10000, - "beta_entropy" : 0.01, - "discount_factor" : 0.999, - "loss_type": "mean squared error", - "num_episodes_between_training" : 20, - "term_cond_max_episodes" : 100000, - "term_cond_avg_score" : 100000 - #"pretrained_s3_bucket": "{}".format(s3_bucket), - #"pretrained_s3_prefix": "rl-deepracer-pretrained" - # "loss_type": "mean squared error", - }, - metric_definitions = metric_definitions, - s3_client=s3Client - #subnets=default_subnets, # Required for VPC mode - #security_group_ids=default_security_groups, # Required for VPC mode - ) - -estimator.fit(job_name=job_name, wait=False) diff --git a/scripts/evaluation/prepare-config.py b/scripts/evaluation/prepare-config.py new file mode 100755 index 00000000..7c8c8c85 --- /dev/null +++ b/scripts/evaluation/prepare-config.py @@ -0,0 +1,171 @@ +#!/usr/bin/python3 + +import boto3 +from datetime import datetime +import sys +import os +import time +import json +import io +import yaml + +def str2bool(v): + return v.lower() in ("yes", "true", "t", "1") + +eval_time = datetime.now().strftime('%Y%m%d%H%M%S') + +config = {} +config['CAR_COLOR'] = [] +config['BODY_SHELL_TYPE'] = [] +config['RACER_NAME'] = [] +config['DISPLAY_NAME'] = [] +config['MODEL_S3_PREFIX'] = [] +config['MODEL_S3_BUCKET'] = [] +config['SIMTRACE_S3_PREFIX'] = [] +config['SIMTRACE_S3_BUCKET'] = [] +config['KINESIS_VIDEO_STREAM_NAME'] = [] +config['METRICS_S3_BUCKET'] = [] +config['METRICS_S3_OBJECT_KEY'] = [] +config['MP4_S3_BUCKET'] = [] +config['MP4_S3_OBJECT_PREFIX'] = [] + +# Basic configuration; including all buckets etc. +config['AWS_REGION'] = os.environ.get('DR_AWS_APP_REGION', 'us-east-1') +config['JOB_TYPE'] = 'EVALUATION' +config['KINESIS_VIDEO_STREAM_NAME'] = os.environ.get('DR_KINESIS_STREAM_NAME', '') +config['ROBOMAKER_SIMULATION_JOB_ACCOUNT_ID'] = os.environ.get('', 'Dummy') + +s3_container_endpoint_url = os.environ.get('DR_MINIO_URL', None) +if s3_container_endpoint_url is not None: + config['S3_ENDPOINT_URL'] = s3_container_endpoint_url + +config['MODEL_S3_PREFIX'].append(os.environ.get('DR_LOCAL_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker')) +config['MODEL_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) +config['SIMTRACE_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) +config['SIMTRACE_S3_PREFIX'].append( + '{}/evaluation-{}'.format(os.environ.get('DR_LOCAL_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker'), eval_time) +) + +# Metrics +config['METRICS_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) +metrics_prefix = os.environ.get('DR_LOCAL_S3_METRICS_PREFIX', None) +if metrics_prefix is not None: + config['METRICS_S3_OBJECT_KEY'].append('{}/evaluation/evaluation-{}.json'.format(metrics_prefix, eval_time)) +else: + config['METRICS_S3_OBJECT_KEY'].append('DeepRacer-Metrics/EvaluationMetrics-{}.json'.format(eval_time)) + +# MP4 configuration / sav +save_mp4 = str2bool(os.environ.get("DR_EVAL_SAVE_MP4", "False")) +if save_mp4: + config['MP4_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) + config['MP4_S3_OBJECT_PREFIX'].append('{}/{}'.format(os.environ.get('DR_LOCAL_S3_MODEL_PREFIX', 'bucket'),'mp4')) + +# Checkpoint +config['EVAL_CHECKPOINT'] = os.environ.get('DR_EVAL_CHECKPOINT', 'last') + +# Car and training +body_shell_type = os.environ.get('DR_CAR_BODY_SHELL_TYPE', 'deepracer') +config['BODY_SHELL_TYPE'].append(body_shell_type) +config['CAR_COLOR'].append(os.environ.get('DR_CAR_COLOR', 'Red')) +config['DISPLAY_NAME'].append(os.environ.get('DR_DISPLAY_NAME', 'racer1')) +config['RACER_NAME'].append(os.environ.get('DR_RACER_NAME', 'racer1')) + +config['RACE_TYPE'] = os.environ.get('DR_RACE_TYPE', 'TIME_TRIAL') +config['WORLD_NAME'] = os.environ.get('DR_WORLD_NAME', 'LGSWide') +config['NUMBER_OF_TRIALS'] = os.environ.get('DR_EVAL_NUMBER_OF_TRIALS', '5') +config['ENABLE_DOMAIN_RANDOMIZATION'] = os.environ.get('DR_ENABLE_DOMAIN_RANDOMIZATION', 'false') +config['RESET_BEHIND_DIST'] = os.environ.get('DR_EVAL_RESET_BEHIND_DIST', '1.0') + +config['IS_CONTINUOUS'] = os.environ.get('DR_EVAL_IS_CONTINUOUS', 'True') +config['NUMBER_OF_RESETS'] = os.environ.get('DR_EVAL_MAX_RESETS', '0') + +config['OFF_TRACK_PENALTY'] = os.environ.get('DR_EVAL_OFF_TRACK_PENALTY', '5.0') +config['COLLISION_PENALTY'] = os.environ.get('DR_COLLISION_PENALTY', '5.0') + +config['CAMERA_MAIN_ENABLE'] = os.environ.get('DR_CAMERA_MAIN_ENABLE', 'True') +config['CAMERA_SUB_ENABLE'] = os.environ.get('DR_CAMERA_SUB_ENABLE', 'True') +config['REVERSE_DIR'] = os.environ.get('DR_EVAL_REVERSE_DIRECTION', False) +config['ENABLE_EXTRA_KVS_OVERLAY'] = os.environ.get('DR_ENABLE_EXTRA_KVS_OVERLAY', 'False') + +# Object Avoidance +if config['RACE_TYPE'] == 'OBJECT_AVOIDANCE': + config['NUMBER_OF_OBSTACLES'] = os.environ.get('DR_OA_NUMBER_OF_OBSTACLES', '6') + config['MIN_DISTANCE_BETWEEN_OBSTACLES'] = os.environ.get('DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES', '2.0') + config['RANDOMIZE_OBSTACLE_LOCATIONS'] = os.environ.get('DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS', 'True') + config['IS_OBSTACLE_BOT_CAR'] = os.environ.get('DR_OA_IS_OBSTACLE_BOT_CAR', 'false') + config['OBSTACLE_TYPE'] = os.environ.get('DR_OA_OBSTACLE_TYPE', 'box_obstacle') + + object_position_str = os.environ.get('DR_OA_OBJECT_POSITIONS', "") + if object_position_str != "": + object_positions = [] + for o in object_position_str.split(";"): + object_positions.append(o) + config['OBJECT_POSITIONS'] = object_positions + config['NUMBER_OF_OBSTACLES'] = str(len(object_positions)) + +# Head to Bot +if config['RACE_TYPE'] == 'HEAD_TO_BOT': + config['IS_LANE_CHANGE'] = os.environ.get('DR_H2B_IS_LANE_CHANGE', 'False') + config['LOWER_LANE_CHANGE_TIME'] = os.environ.get('DR_H2B_LOWER_LANE_CHANGE_TIME', '3.0') + config['UPPER_LANE_CHANGE_TIME'] = os.environ.get('DR_H2B_UPPER_LANE_CHANGE_TIME', '5.0') + config['LANE_CHANGE_DISTANCE'] = os.environ.get('DR_H2B_LANE_CHANGE_DISTANCE', '1.0') + config['NUMBER_OF_BOT_CARS'] = os.environ.get('DR_H2B_NUMBER_OF_BOT_CARS', '0') + config['MIN_DISTANCE_BETWEEN_BOT_CARS'] = os.environ.get('DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS', '2.0') + config['RANDOMIZE_BOT_CAR_LOCATIONS'] = os.environ.get('DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS', 'False') + config['BOT_CAR_SPEED'] = os.environ.get('DR_H2B_BOT_CAR_SPEED', '0.2') + config['PENALTY_SECONDS'] = os.environ.get('DR_H2B_BOT_CAR_PENALTY', '2.0') + +# Head to Model +if config['RACE_TYPE'] == 'HEAD_TO_MODEL': + config['MODEL_S3_PREFIX'].append(os.environ.get('DR_EVAL_OPP_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker')) + config['MODEL_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) + config['SIMTRACE_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) + config['SIMTRACE_S3_PREFIX'].append(os.environ.get('DR_EVAL_OPP_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker')) + + # Metrics + config['METRICS_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) + metrics_prefix = os.environ.get('DR_EVAL_OPP_S3_METRICS_PREFIX', '{}/{}'.format(os.environ.get('DR_EVAL_OPP_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker'),'metrics')) + if metrics_prefix is not None: + config['METRICS_S3_OBJECT_KEY'].append('{}/EvaluationMetrics-{}.json'.format(metrics_prefix, str(round(time.time())))) + else: + config['METRICS_S3_OBJECT_KEY'].append('DeepRacer-Metrics/EvaluationMetrics-{}.json'.format(str(round(time.time())))) + + # MP4 configuration / sav + save_mp4 = str2bool(os.environ.get("DR_EVAL_SAVE_MP4", "False")) + if save_mp4: + config['MP4_S3_BUCKET'].append(os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket')) + config['MP4_S3_OBJECT_PREFIX'].append('{}/{}'.format(os.environ.get('DR_EVAL_OPP_MODEL_PREFIX', 'bucket'),'mp4')) + + # Car and training + config['DISPLAY_NAME'].append(os.environ.get('DR_EVAL_OPP_DISPLAY_NAME', 'racer1')) + config['RACER_NAME'].append(os.environ.get('DR_EVAL_OPP_RACER_NAME', 'racer1')) + + body_shell_type = os.environ.get('DR_EVAL_OPP_CAR_BODY_SHELL_TYPE', 'deepracer') + config['BODY_SHELL_TYPE'].append(body_shell_type) + config['VIDEO_JOB_TYPE'] = 'EVALUATION' + config['CAR_COLOR'] = ['Purple', 'Orange'] + config['MODEL_NAME'] = config['DISPLAY_NAME'] + +# S3 Setup / write and upload file +s3_local_endpoint_url = os.environ.get('DR_LOCAL_S3_ENDPOINT_URL', None) +s3_region = config['AWS_REGION'] +s3_bucket = config['MODEL_S3_BUCKET'][0] +s3_prefix = config['MODEL_S3_PREFIX'][0] +s3_mode = os.environ.get('DR_LOCAL_S3_AUTH_MODE','profile') +if s3_mode == 'profile': + s3_profile = os.environ.get('DR_LOCAL_S3_PROFILE', 'default') +else: # mode is 'role' + s3_profile = None +s3_yaml_name = os.environ.get('DR_LOCAL_S3_EVAL_PARAMS_FILE', 'eval_params.yaml') +yaml_key = os.path.normpath(os.path.join(s3_prefix, s3_yaml_name)) + +session = boto3.session.Session(profile_name=s3_profile) +s3_client = session.client('s3', region_name=s3_region, endpoint_url=s3_local_endpoint_url) + +yaml_key = os.path.normpath(os.path.join(s3_prefix, s3_yaml_name)) +local_yaml_path = os.path.abspath(os.path.join(os.environ.get('DR_DIR'),'tmp', 'eval-params-' + str(round(time.time())) + '.yaml')) + +with open(local_yaml_path, 'w') as yaml_file: + yaml.dump(config, yaml_file, default_flow_style=False, default_style='\'', explicit_start=True) + +s3_client.upload_file(Bucket=s3_bucket, Key=yaml_key, Filename=local_yaml_path) diff --git a/scripts/evaluation/start.sh b/scripts/evaluation/start.sh index d788af5c..21fd39e9 100755 --- a/scripts/evaluation/start.sh +++ b/scripts/evaluation/start.sh @@ -1,18 +1,121 @@ +#!/usr/bin/env bash + +source $DR_DIR/bin/scripts_wrapper.sh + +usage() { + echo "Usage: $0 [-q] [-c]" + echo " -q Quiet - does not start log tracing." + echo " -c Clone - copies model into new prefix before evaluating." + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +while getopts ":qc" opt; do + case $opt in + q) + OPT_QUIET="QUIET" + ;; + c) + OPT_CLONE="CLONE" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +## Check if WSL2 +if grep -qi Microsoft /proc/version && grep -q "WSL2" /proc/version; then + IS_WSL2="yes" +fi + +# set evaluation specific environment variables +STACK_NAME="deepracer-eval-$DR_RUN_ID" +STACK_CONTAINERS=$(docker stack ps $STACK_NAME 2>/dev/null | wc -l) +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + if [[ "$STACK_CONTAINERS" -gt 1 ]]; then + echo "ERROR: Processes running in stack $STACK_NAME. Stop evaluation with dr-stop-evaluation." + exit 1 + fi +fi + +echo "Evaluation of model s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_MODEL_PREFIX starting." +echo "Using image ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION}" +echo "" + +# clone if required +if [ -n "$OPT_CLONE" ]; then + echo "Cloning model into s3://$DR_LOCAL_S3_BUCKET/${DR_LOCAL_S3_MODEL_PREFIX}-E" + aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 sync s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_MODEL_PREFIX/model s3://$DR_LOCAL_S3_BUCKET/${DR_LOCAL_S3_MODEL_PREFIX}-E/model + aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 sync s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_MODEL_PREFIX/ip s3://$DR_LOCAL_S3_BUCKET/${DR_LOCAL_S3_MODEL_PREFIX}-E/ip + export DR_LOCAL_S3_MODEL_PREFIX=${DR_LOCAL_S3_MODEL_PREFIX}-E +fi + # set evaluation specific environment variables -export ROBOMAKER_COMMAND="./run.sh build evaluation.launch" -export METRICS_S3_OBJECT_KEY=custom_files/eval_metrics.json -export NUMBER_OF_TRIALS=5 +S3_PATH="s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_MODEL_PREFIX" + +export ROBOMAKER_COMMAND="/opt/ml/code/run.sh run evaluation.launch.py" +export DR_CURRENT_PARAMS_FILE=${DR_LOCAL_S3_EVAL_PARAMS_FILE} + +if [ ${DR_ROBOMAKER_MOUNT_LOGS,,} = "true" ]; then + COMPOSE_FILES="$DR_EVAL_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DR_DIR/docker/docker-compose-mount.yml" + export DR_MOUNT_DIR="$DR_DIR/data/logs/robomaker/$DR_LOCAL_S3_MODEL_PREFIX" + mkdir -p $DR_MOUNT_DIR +else + COMPOSE_FILES="$DR_EVAL_COMPOSE_FILE" +fi + +echo "Creating Robomaker configuration in $S3_PATH/$DR_CURRENT_PARAMS_FILE" +python3 $DR_DIR/scripts/evaluation/prepare-config.py + +# Check if we are using Host X -- ensure variables are populated +if [[ "${DR_HOST_X,,}" == "true" ]]; then + if [[ -n "$DR_DISPLAY" ]]; then + ROBO_DISPLAY=$DR_DISPLAY + else + ROBO_DISPLAY=$DISPLAY + fi + + if ! DISPLAY=$ROBO_DISPLAY timeout 1s xset q &>/dev/null; then + echo "No X Server running on display $ROBO_DISPLAY. Exiting" + exit 1 + fi -docker-compose -f ../../docker/docker-compose.yml up -d + if [[ -z "$XAUTHORITY" && "$IS_WSL2" != "yes" ]]; then + export XAUTHORITY=~/.Xauthority + if [[ ! -f "$XAUTHORITY" ]]; then + echo "No XAUTHORITY defined. .Xauthority does not exist. Stopping." + exit 1 + fi + fi +fi +# Check if we will use Docker Swarm or Docker Compose +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then -echo 'waiting for containers to start up...' + if [ "$DR_DOCKER_MAJOR_VERSION" -gt 24 ]; then + DETACH_FLAG="--detach=true" + fi -#sleep for 20 seconds to allow the containers to start -sleep 15 + DISPLAY=$ROBO_DISPLAY docker stack deploy $COMPOSE_FILES $DETACH_FLAG $STACK_NAME +else + DISPLAY=$ROBO_DISPLAY docker compose $COMPOSE_FILES -p $STACK_NAME up -d +fi -echo 'attempting to pull up sagemaker logs...' -gnome-terminal -x sh -c "!!; docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }')" +# Request to be quiet. Quitting here. +if [ -n "$OPT_QUIET" ]; then + exit 0 +fi -echo 'attempting to open vnc viewer...' -gnome-terminal -x sh -c "!!; vncviewer localhost:8080" +# Trigger requested log-file +dr-logs-robomaker -w 15 -e diff --git a/scripts/evaluation/stop.sh b/scripts/evaluation/stop.sh index bb1776fd..d4ea7a6b 100755 --- a/scripts/evaluation/stop.sh +++ b/scripts/evaluation/stop.sh @@ -1,6 +1,13 @@ #!/usr/bin/env bash -docker-compose -f ../../docker/docker-compose.yml down +STACK_NAME="deepracer-eval-$DR_RUN_ID" +RUN_NAME=${DR_LOCAL_S3_MODEL_PREFIX} -docker stop $(docker ps | awk ' /sagemaker/ { print $1 }') -docker rm $(docker ps -a | awk ' /sagemaker/ { print $1 }') \ No newline at end of file +# Check if we will use Docker Swarm or Docker Compose +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + docker stack rm $STACK_NAME +else + COMPOSE_FILES=$(echo ${DR_EVAL_COMPOSE_FILE} | cut -f1-2 -d\ ) + export DR_CURRENT_PARAMS_FILE="" + docker compose $COMPOSE_FILES -p $STACK_NAME down +fi diff --git a/scripts/log-analysis/start.sh b/scripts/log-analysis/start.sh index c0730070..e20210ba 100755 --- a/scripts/log-analysis/start.sh +++ b/scripts/log-analysis/start.sh @@ -1,7 +1,12 @@ #!/usr/bin/env bash -nvidia-docker run --rm -it -p "8888:8888" \ --v `pwd`/../../docker/volumes/log-analysis:/workspace/venv/data \ --v `pwd`/../../docker/volumes/.aws:/root/.aws \ --v `pwd`/../../docker/volumes/robo/checkpoint/log:/workspace/venv/logs \ - aschu/log-analysis +docker run --rm -d -p "8888:8888" \ +-v $DR_DIR/data/logs:/workspace/logs \ +-v $DR_DIR/docker/volumes/.aws:/home/ubuntu/.aws \ +-v $DR_DIR/data/analysis:/workspace/analysis \ +-v $DR_DIR/data/minio:/workspace/minio \ +--name deepracer-analysis \ +--network sagemaker-local \ + awsdeepracercommunity/deepracer-analysis:$DR_ANALYSIS_IMAGE + +docker logs -f deepracer-analysis \ No newline at end of file diff --git a/scripts/log-analysis/stop.sh b/scripts/log-analysis/stop.sh index e9c3cab3..5fd5ec58 100755 --- a/scripts/log-analysis/stop.sh +++ b/scripts/log-analysis/stop.sh @@ -1,4 +1,3 @@ #!/usr/bin/env bash -docker stop $(docker ps | awk ' /analysis/ { print $1 }') -docker rm $(docker ps -a | awk ' /analysis/ { print $1 }') +docker stop deepracer-analysis diff --git a/scripts/metrics/start.sh b/scripts/metrics/start.sh new file mode 100755 index 00000000..2baa0f00 --- /dev/null +++ b/scripts/metrics/start.sh @@ -0,0 +1,5 @@ +#!/bin/bash + +COMPOSE_FILES=./docker/docker-compose-metrics.yml + +docker compose -f $COMPOSE_FILES -p deepracer-metrics up -d \ No newline at end of file diff --git a/scripts/metrics/stop.sh b/scripts/metrics/stop.sh new file mode 100755 index 00000000..6f68de5d --- /dev/null +++ b/scripts/metrics/stop.sh @@ -0,0 +1,5 @@ +#!/bin/bash + +COMPOSE_FILES=./docker/docker-compose-metrics.yml + +docker compose -f $COMPOSE_FILES -p deepracer-metrics down \ No newline at end of file diff --git a/scripts/training/back-up-training-run.sh b/scripts/training/back-up-training-run.sh deleted file mode 100755 index f3e7c7dc..00000000 --- a/scripts/training/back-up-training-run.sh +++ /dev/null @@ -1,6 +0,0 @@ -#!/usr/bin/env bash - -BACKUP_LOC=/media/aschu/storage/deepracer-training/backup -FILENAME=$(date +%Y-%m-%d_%H-%M-%S) -tar -czvf ${FILENAME}.tar.gz ../../docker/volumes/minio/bucket/rl-deepracer-sagemaker/* -mv ${FILENAME}.tar.gz $BACKUP_LOC \ No newline at end of file diff --git a/scripts/training/delete-last-run.sh b/scripts/training/delete-last-run.sh deleted file mode 100755 index b2f5883f..00000000 --- a/scripts/training/delete-last-run.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/usr/bin/env bash - -rm -rf ../../docker/volumes/minio/bucket/rl-deepracer-sagemaker -rm -rf ../../docker/volumes/robo/checkpoint/checkpoint -mkdir ../../docker/volumes/robo/checkpoint/checkpoint -rm -rf /robo/container/* -rm -rf ../../docker/volumes/robo/checkpoint/log/* diff --git a/scripts/training/increment.sh b/scripts/training/increment.sh new file mode 100755 index 00000000..36560373 --- /dev/null +++ b/scripts/training/increment.sh @@ -0,0 +1,94 @@ +#!/bin/bash + +usage() { + echo "Usage: $0 [-f] [-w] [-p ] [-d ]" + echo "" + echo "Command will set the current model to be the pre-trained model and increment a numerical suffix." + echo "-p model Sets the to-be name to be rather than auto-incremeneting the previous model." + echo "-d delim Delimiter in model-name (e.g. '-' in 'test-model-1')" + echo "-f Force. Ask for no confirmations." + echo "-w Wipe the S3 prefix to ensure that two models are not mixed." + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +OPT_DELIM='-' + +while getopts ":fwp:d:" opt; do + case $opt in + + f) + OPT_FORCE="True" + ;; + p) + OPT_PREFIX="$OPTARG" + ;; + w) + OPT_WIPE="--delete" + ;; + d) + OPT_DELIM="$OPTARG" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +CONFIG_FILE=$DR_CONFIG +echo "Configuration file $CONFIG_FILE will be updated." + +## Read in data +CURRENT_RUN_MODEL=$(grep -e "^DR_LOCAL_S3_MODEL_PREFIX" ${CONFIG_FILE} | awk '{split($0,a,"="); print a[2] }') +CURRENT_RUN_MODEL_NUM=$(echo "${CURRENT_RUN_MODEL}" | + awk -v DELIM="${OPT_DELIM}" '{ n=split($0,a,DELIM); if (a[n] ~ /[0-9]*/) print a[n]; else print ""; }') +if [[ -n ${OPT_PREFIX} ]]; then + NEW_RUN_MODEL="${OPT_PREFIX}" +else + if [[ -z ${CURRENT_RUN_MODEL_NUM} ]]; then + NEW_RUN_MODEL="${CURRENT_RUN_MODEL}${OPT_DELIM}1" + else + NEW_RUN_MODEL_NUM=$(echo "${CURRENT_RUN_MODEL_NUM} + 1" | bc) + NEW_RUN_MODEL=$(echo $CURRENT_RUN_MODEL | sed "s/${CURRENT_RUN_MODEL_NUM}\$/${NEW_RUN_MODEL_NUM}/") + fi +fi + +if [[ -n "${NEW_RUN_MODEL}" ]]; then + echo "Incrementing model from ${CURRENT_RUN_MODEL} to ${NEW_RUN_MODEL}" + if [[ -z "${OPT_FORCE}" ]]; then + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi + fi + sed -i.bak -re "s/(DR_LOCAL_S3_PRETRAINED_PREFIX=).*$/\1$CURRENT_RUN_MODEL/g; s/(DR_LOCAL_S3_PRETRAINED=).*$/\1True/g; ; s/(DR_LOCAL_S3_MODEL_PREFIX=).*$/\1$NEW_RUN_MODEL/g" "$CONFIG_FILE" && echo "Done." +else + echo "Error in determining new model. Aborting." + exit 1 +fi + +if [[ -n "${OPT_WIPE}" ]]; then + MODEL_DIR_S3=$(aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 ls s3://${DR_LOCAL_S3_BUCKET}/${NEW_RUN_MODEL}) + if [[ -n "${MODEL_DIR_S3}" ]]; then + echo "The new model's S3 prefix s3://${DR_LOCAL_S3_BUCKET}/${NEW_RUN_MODEL} exists. Will wipe." + fi + if [[ -z "${OPT_FORCE}" ]]; then + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi + fi + aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 rm s3://${DR_LOCAL_S3_BUCKET}/${NEW_RUN_MODEL} --recursive +fi diff --git a/scripts/training/prepare-config.py b/scripts/training/prepare-config.py new file mode 100755 index 00000000..fe7ff448 --- /dev/null +++ b/scripts/training/prepare-config.py @@ -0,0 +1,232 @@ +#!/usr/bin/python3 + +from datetime import datetime +import boto3 +import sys +import os +import time +import json +import io +import yaml + +train_time = datetime.now().strftime('%Y%m%d%H%M%S') + +config = {} +config['AWS_REGION'] = os.environ.get('DR_AWS_APP_REGION', 'us-east-1') +config['JOB_TYPE'] = 'TRAINING' +config['KINESIS_VIDEO_STREAM_NAME'] = os.environ.get('DR_KINESIS_STREAM_NAME', '') +config['METRICS_S3_BUCKET'] = os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket') + +s3_container_endpoint_url = os.environ.get('DR_MINIO_URL', None) +if s3_container_endpoint_url is not None: + config['S3_ENDPOINT_URL'] = s3_container_endpoint_url + +metrics_prefix = os.environ.get('DR_LOCAL_S3_METRICS_PREFIX', None) +if metrics_prefix is not None: + config['METRICS_S3_OBJECT_KEY'] = '{}/TrainingMetrics.json'.format(metrics_prefix) +else: + config['METRICS_S3_OBJECT_KEY'] = 'DeepRacer-Metrics/TrainingMetrics-{}.json'.format(train_time) + +config['MODEL_METADATA_FILE_S3_KEY'] = os.environ.get('DR_LOCAL_S3_MODEL_METADATA_KEY', 'custom_files/model_metadata.json') +config['REWARD_FILE_S3_KEY'] = os.environ.get('DR_LOCAL_S3_REWARD_KEY', 'custom_files/reward_function.py') +config['ROBOMAKER_SIMULATION_JOB_ACCOUNT_ID'] = os.environ.get('', 'Dummy') +config['NUM_WORKERS'] = os.environ.get('DR_WORKERS', 1) +config['SAGEMAKER_SHARED_S3_BUCKET'] = os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket') +config['SAGEMAKER_SHARED_S3_PREFIX'] = os.environ.get('DR_LOCAL_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker') +config['SIMTRACE_S3_BUCKET'] = os.environ.get('DR_LOCAL_S3_BUCKET', 'bucket') +config['SIMTRACE_S3_PREFIX'] = os.environ.get('DR_LOCAL_S3_MODEL_PREFIX', 'rl-deepracer-sagemaker') +config['TRAINING_JOB_ARN'] = 'arn:Dummy' + +# Car and training +config['BODY_SHELL_TYPE'] = os.environ.get('DR_CAR_BODY_SHELL_TYPE', 'deepracer') +config['CAR_COLOR'] = os.environ.get('DR_CAR_COLOR', 'Red') +config['CAR_NAME'] = os.environ.get('DR_CAR_NAME', 'MyCar') +config['RACE_TYPE'] = os.environ.get('DR_RACE_TYPE', 'TIME_TRIAL') +config['WORLD_NAME'] = os.environ.get('DR_WORLD_NAME', 'LGSWide') +config['DISPLAY_NAME'] = os.environ.get('DR_DISPLAY_NAME', 'racer1') +config['RACER_NAME'] = os.environ.get('DR_RACER_NAME', 'racer1') + +config['REVERSE_DIR'] = os.environ.get('DR_TRAIN_REVERSE_DIRECTION', False) +config['ALTERNATE_DRIVING_DIRECTION'] = os.environ.get('DR_TRAIN_ALTERNATE_DRIVING_DIRECTION', os.environ.get('DR_ALTERNATE_DRIVING_DIRECTION', 'false')) +config['CHANGE_START_POSITION'] = os.environ.get('DR_TRAIN_CHANGE_START_POSITION', os.environ.get('DR_CHANGE_START_POSITION', 'true')) +config['ROUND_ROBIN_ADVANCE_DIST'] = os.environ.get('DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST', '0.05') +config['START_POSITION_OFFSET'] = os.environ.get('DR_TRAIN_START_POSITION_OFFSET', '0.00') +config['ENABLE_DOMAIN_RANDOMIZATION'] = os.environ.get('DR_ENABLE_DOMAIN_RANDOMIZATION', 'false') +config['MIN_EVAL_TRIALS'] = os.environ.get('DR_TRAIN_MIN_EVAL_TRIALS', '5') +config['CAMERA_MAIN_ENABLE'] = os.environ.get('DR_CAMERA_MAIN_ENABLE', 'True') +config['CAMERA_SUB_ENABLE'] = os.environ.get('DR_CAMERA_SUB_ENABLE', 'True') +config['BEST_MODEL_METRIC'] = os.environ.get('DR_TRAIN_BEST_MODEL_METRIC', 'progress') +config['ENABLE_EXTRA_KVS_OVERLAY'] = os.environ.get('DR_ENABLE_EXTRA_KVS_OVERLAY', 'False') + +# Object Avoidance +if config['RACE_TYPE'] == 'OBJECT_AVOIDANCE': + config['NUMBER_OF_OBSTACLES'] = os.environ.get('DR_OA_NUMBER_OF_OBSTACLES', '6') + config['MIN_DISTANCE_BETWEEN_OBSTACLES'] = os.environ.get('DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES', '2.0') + config['RANDOMIZE_OBSTACLE_LOCATIONS'] = os.environ.get('DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS', 'True') + config['IS_OBSTACLE_BOT_CAR'] = os.environ.get('DR_OA_IS_OBSTACLE_BOT_CAR', 'false') + config['OBSTACLE_TYPE'] = os.environ.get('DR_OA_OBSTACLE_TYPE', 'box_obstacle') + + object_position_str = os.environ.get('DR_OA_OBJECT_POSITIONS', "") + if object_position_str != "": + object_positions = [] + for o in object_position_str.split(";"): + object_positions.append(o) + config['OBJECT_POSITIONS'] = object_positions + config['NUMBER_OF_OBSTACLES'] = str(len(object_positions)) + +# Head to Bot +if config['RACE_TYPE'] == 'HEAD_TO_BOT': + config['IS_LANE_CHANGE'] = os.environ.get('DR_H2B_IS_LANE_CHANGE', 'False') + config['LOWER_LANE_CHANGE_TIME'] = os.environ.get('DR_H2B_LOWER_LANE_CHANGE_TIME', '3.0') + config['UPPER_LANE_CHANGE_TIME'] = os.environ.get('DR_H2B_UPPER_LANE_CHANGE_TIME', '5.0') + config['LANE_CHANGE_DISTANCE'] = os.environ.get('DR_H2B_LANE_CHANGE_DISTANCE', '1.0') + config['NUMBER_OF_BOT_CARS'] = os.environ.get('DR_H2B_NUMBER_OF_BOT_CARS', '0') + config['MIN_DISTANCE_BETWEEN_BOT_CARS'] = os.environ.get('DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS', '2.0') + config['RANDOMIZE_BOT_CAR_LOCATIONS'] = os.environ.get('DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS', 'False') + config['BOT_CAR_SPEED'] = os.environ.get('DR_H2B_BOT_CAR_SPEED', '0.2') + config['PENALTY_SECONDS'] = os.environ.get('DR_H2B_BOT_CAR_PENALTY', '2.0') + +s3_local_endpoint_url = os.environ.get('DR_LOCAL_S3_ENDPOINT_URL', None) +s3_region = config['AWS_REGION'] +s3_bucket = config['SAGEMAKER_SHARED_S3_BUCKET'] +s3_prefix = config['SAGEMAKER_SHARED_S3_PREFIX'] +s3_mode = os.environ.get('DR_LOCAL_S3_AUTH_MODE','profile') +if s3_mode == 'profile': + s3_profile = os.environ.get('DR_LOCAL_S3_PROFILE', 'default') +else: # mode is 'role' + s3_profile = None +s3_yaml_name = os.environ.get('DR_LOCAL_S3_TRAINING_PARAMS_FILE', 'training_params.yaml') +yaml_key = os.path.normpath(os.path.join(s3_prefix, s3_yaml_name)) + +session = boto3.session.Session(profile_name=s3_profile) +s3_client = session.client('s3', region_name=s3_region, endpoint_url=s3_local_endpoint_url) + +yaml_key = os.path.normpath(os.path.join(s3_prefix, s3_yaml_name)) +local_yaml_path = os.path.abspath(os.path.join(os.environ.get('DR_DIR'),'tmp', 'training-params-' + train_time + '.yaml')) + +with open(local_yaml_path, 'w') as yaml_file: + yaml.dump(config, yaml_file, default_flow_style=False, default_style='\'', explicit_start=True) + +# Copy the reward function to the s3 prefix bucket for compatability with DeepRacer console. +reward_function_key = os.path.normpath(os.path.join(s3_prefix, "reward_function.py")) +copy_source = { + 'Bucket': s3_bucket, + 'Key': config['REWARD_FILE_S3_KEY'] +} +s3_client.copy(copy_source, Bucket=s3_bucket, Key=reward_function_key) + +# Training with different configurations on each worker (aka Multi Config training) +config['MULTI_CONFIG'] = os.environ.get('DR_TRAIN_MULTI_CONFIG', 'False') +num_workers = int(config['NUM_WORKERS']) + +if config['MULTI_CONFIG'] == "True" and num_workers > 1: + + multi_config = {} + multi_config['multi_config'] = [None] * num_workers + + for i in range(1,num_workers+1,1): + if i == 1: + # copy training_params to training_params_1 + s3_yaml_name_list = s3_yaml_name.split('.') + s3_yaml_name_temp = s3_yaml_name_list[0] + "_%d.yaml" % i + + #upload additional training params files + yaml_key = os.path.normpath(os.path.join(s3_prefix, s3_yaml_name_temp)) + s3_client.upload_file(Bucket=s3_bucket, Key=yaml_key, Filename=local_yaml_path) + + # Store in multi_config array + multi_config['multi_config'][i - 1] = {'config_file': s3_yaml_name_temp, + 'world_name': config['WORLD_NAME']} + + else: # i >= 2 + #read in additional configuration file. format of file must be worker#-run.env + location = os.path.abspath(os.path.join(os.environ.get('DR_DIR'),'worker-{}.env'.format(i))) + with open(location, 'r') as fh: + vars_dict = dict( + tuple(line.split('=')) + for line in fh.read().splitlines() if not line.startswith('#') + ) + + # Reset parameters for the configuration of this worker number + os.environ.update(vars_dict) + + # Update car and training parameters + config.update({'WORLD_NAME': os.environ.get('DR_WORLD_NAME')}) + config.update({'RACE_TYPE': os.environ.get('DR_RACE_TYPE')}) + config.update({'CAR_COLOR': os.environ.get('DR_CAR_COLOR')}) + config.update({'BODY_SHELL_TYPE': os.environ.get('DR_CAR_BODY_SHELL_TYPE')}) + config.update({'ALTERNATE_DRIVING_DIRECTION': os.environ.get('DR_TRAIN_ALTERNATE_DRIVING_DIRECTION')}) + config.update({'CHANGE_START_POSITION': os.environ.get('DR_TRAIN_CHANGE_START_POSITION')}) + config.update({'ROUND_ROBIN_ADVANCE_DIST': os.environ.get('DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST')}) + config.update({'ENABLE_DOMAIN_RANDOMIZATION': os.environ.get('DR_ENABLE_DOMAIN_RANDOMIZATION')}) + config.update({'START_POSITION_OFFSET': os.environ.get('DR_TRAIN_START_POSITION_OFFSET', '0.00')}) + config.update({'REVERSE_DIR': os.environ.get('DR_TRAIN_REVERSE_DIRECTION', False)}) + config.update({'CAMERA_MAIN_ENABLE': os.environ.get('DR_CAMERA_MAIN_ENABLE', 'True')}) + config.update({'CAMERA_SUB_ENABLE': os.environ.get('DR_CAMERA_SUB_ENABLE', 'True')}) + config.update({'ENABLE_EXTRA_KVS_OVERLAY': os.environ.get('DR_ENABLE_EXTRA_KVS_OVERLAY', 'False')}) + + + # Update Object Avoidance parameters + if config['RACE_TYPE'] == 'OBJECT_AVOIDANCE': + config.update({'NUMBER_OF_OBSTACLES': os.environ.get('DR_OA_NUMBER_OF_OBSTACLES')}) + config.update({'MIN_DISTANCE_BETWEEN_OBSTACLES': os.environ.get('DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES')}) + config.update({'RANDOMIZE_OBSTACLE_LOCATIONS': os.environ.get('DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS')}) + config.update({'IS_OBSTACLE_BOT_CAR': os.environ.get('DR_OA_IS_OBSTACLE_BOT_CAR')}) + config.update({'OBSTACLE_TYPE': os.environ.get('DR_OA_OBSTACLE_TYPE', 'box_obstacle')}) + + object_position_str = os.environ.get('DR_OA_OBJECT_POSITIONS', "") + if object_position_str != "": + object_positions = [] + for o in object_position_str.replace('"','').split(";"): + object_positions.append(o) + config.update({'OBJECT_POSITIONS': object_positions}) + config.update({'NUMBER_OF_OBSTACLES': str(len(object_positions))}) + else: + config.pop('OBJECT_POSITIONS',[]) + else: + config.pop('NUMBER_OF_OBSTACLES', None) + config.pop('MIN_DISTANCE_BETWEEN_OBSTACLES', None) + config.pop('RANDOMIZE_OBSTACLE_LOCATIONS', None) + config.pop('IS_OBSTACLE_BOT_CAR', None) + config.pop('OBJECT_POSITIONS',[]) + + # Update Head to Bot parameters + if config['RACE_TYPE'] == 'HEAD_TO_BOT': + config.update({'IS_LANE_CHANGE': os.environ.get('DR_H2B_IS_LANE_CHANGE')}) + config.update({'LOWER_LANE_CHANGE_TIME': os.environ.get('DR_H2B_LOWER_LANE_CHANGE_TIME')}) + config.update({'UPPER_LANE_CHANGE_TIME': os.environ.get('DR_H2B_UPPER_LANE_CHANGE_TIME')}) + config.update({'LANE_CHANGE_DISTANCE': os.environ.get('DR_H2B_LANE_CHANGE_DISTANCE')}) + config.update({'NUMBER_OF_BOT_CARS': os.environ.get('DR_H2B_NUMBER_OF_BOT_CARS')}) + config.update({'MIN_DISTANCE_BETWEEN_BOT_CARS': os.environ.get('DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS')}) + config.update({'RANDOMIZE_BOT_CAR_LOCATIONS': os.environ.get('DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS')}) + config.update({'BOT_CAR_SPEED': os.environ.get('DR_H2B_BOT_CAR_SPEED')}) + config.update({'PENALTY_SECONDS': os.environ.get('DR_H2B_BOT_CAR_PENALTY')}) + else: + config.pop('IS_LANE_CHANGE', None) + config.pop('LOWER_LANE_CHANGE_TIME', None) + config.pop('UPPER_LANE_CHANGE_TIME', None) + config.pop('LANE_CHANGE_DISTANCE', None) + config.pop('NUMBER_OF_BOT_CARS', None) + config.pop('MIN_DISTANCE_BETWEEN_BOT_CARS', None) + config.pop('RANDOMIZE_BOT_CAR_LOCATIONS', None) + config.pop('BOT_CAR_SPEED', None) + + #split string s3_yaml_name, insert the worker number, and add back on the .yaml extension + s3_yaml_name_list = s3_yaml_name.split('.') + s3_yaml_name_temp = s3_yaml_name_list[0] + "_%d.yaml" % i + + #upload additional training params files + yaml_key = os.path.normpath(os.path.join(s3_prefix, s3_yaml_name_temp)) + local_yaml_path = os.path.abspath(os.path.join(os.environ.get('DR_DIR'),'tmp', 'training-params-' + train_time + '-' + str(i) + '.yaml')) + with open(local_yaml_path, 'w') as yaml_file: + yaml.dump(config, yaml_file, default_flow_style=False, default_style='\'', explicit_start=True) + s3_client.upload_file(Bucket=s3_bucket, Key=yaml_key, Filename=local_yaml_path) + + # Store in multi_config array + multi_config['multi_config'][i - 1] = {'config_file': s3_yaml_name_temp, + 'world_name': config['WORLD_NAME']} + + print(json.dumps(multi_config)) + +else: + s3_client.upload_file(Bucket=s3_bucket, Key=yaml_key, Filename=local_yaml_path) diff --git a/scripts/training/set-last-run-to-pretrained.sh b/scripts/training/set-last-run-to-pretrained.sh deleted file mode 100755 index c4a13e9c..00000000 --- a/scripts/training/set-last-run-to-pretrained.sh +++ /dev/null @@ -1,4 +0,0 @@ -#!/usr/bin/env bash - -rm -rf ../../docker/volumes/minio/bucket/rl-deepracer-pretrained -mv ../../docker/volumes/minio/bucket/rl-deepracer-sagemaker ../../docker/volumes/minio/bucket/rl-deepracer-pretrained \ No newline at end of file diff --git a/scripts/training/start.sh b/scripts/training/start.sh index 52e599da..ef6ab894 100755 --- a/scripts/training/start.sh +++ b/scripts/training/start.sh @@ -1,15 +1,234 @@ #!/usr/bin/env bash -export ROBOMAKER_COMMAND="./run.sh build distributed_training.launch" +source $DR_DIR/bin/scripts_wrapper.sh -docker-compose -f ../../docker/docker-compose.yml up -d -echo 'waiting for containers to start up...' +usage() { + echo "Usage: $0 [-w] [-q | -s | -r [n] | -a ] [-v]" + echo " -w Wipes the target AWS DeepRacer model structure before upload." + echo " -q Do not output / follow a log when starting." + echo " -a Follow all Sagemaker and Robomaker logs." + echo " -s Follow Sagemaker logs (default)." + echo " -v Updates the viewer webpage." + echo " -r [n] Follow Robomaker logs for worker n (default worker 0 / replica 1)." + exit 1 +} -#sleep for 20 seconds to allow the containers to start -sleep 15 +trap ctrl_c INT -echo 'attempting to pull up sagemaker logs...' -gnome-terminal -x sh -c "!!; docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }')" +function ctrl_c() { + echo "Requested to stop." + exit 1 +} -echo 'attempting to open vnc viewer...' -gnome-terminal -x sh -c "!!; vncviewer localhost:8080" \ No newline at end of file +OPT_DISPLAY="SAGEMAKER" + +while getopts ":whqsavr:" opt; do + case $opt in + w) + OPT_WIPE="WIPE" + ;; + q) + OPT_QUIET="QUIET" + ;; + s) + OPT_DISPLAY="SAGEMAKER" + ;; + a) + OPT_DISPLAY="ALL" + ;; + r) # Check if value is in numeric format. + OPT_DISPLAY="ROBOMAKER" + if [[ $OPTARG =~ ^[0-9]+$ ]]; then + OPT_ROBOMAKER=$OPTARG + else + OPT_ROBOMAKER=0 + ((OPTIND--)) + fi + ;; + v) + OPT_VIEWER="VIEWER" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +## Check if WSL2 +if grep -qi Microsoft /proc/version && grep -q "WSL2" /proc/version; then + IS_WSL2="yes" +fi + +# Ensure Sagemaker's folder is there +if [ ! -d /tmp/sagemaker ]; then + sudo mkdir -p /tmp/sagemaker + sudo chmod -R g+w /tmp/sagemaker +fi + +# set evaluation specific environment variables +STACK_NAME="deepracer-$DR_RUN_ID" +STACK_CONTAINERS=$(docker stack ps $STACK_NAME 2>/dev/null | wc -l) +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + if [[ "$STACK_CONTAINERS" -gt 1 ]]; then + echo "ERROR: Processes running in stack $STACK_NAME. Stop training with dr-stop-training." + exit 1 + fi +fi + +# Check if metadata-files are available +WORK_DIR=${DR_DIR}/tmp/start/ +mkdir -p ${WORK_DIR} +rm -f ${WORK_DIR}/* + +REWARD_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_REWARD_KEY} ${WORK_DIR} --no-progress 2>/dev/null | awk '/reward/ {print $4}' | xargs readlink -f 2>/dev/null) +METADATA_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_MODEL_METADATA_KEY} ${WORK_DIR} --no-progress 2>/dev/null | awk '/model_metadata.json$/ {print $4}' | xargs readlink -f 2>/dev/null) +HYPERPARAM_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_HYPERPARAMETERS_KEY} ${WORK_DIR} --no-progress 2>/dev/null | awk '/hyperparameters.json$/ {print $4}' | xargs readlink -f 2>/dev/null) + +if [ -n "$METADATA_FILE" ] && [ -n "$REWARD_FILE" ] && [ -n "$HYPERPARAM_FILE" ]; then + echo "Training of model s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_MODEL_PREFIX starting." + echo "Using configuration files:" + echo " s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_REWARD_KEY}" + echo " s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_MODEL_METADATA_KEY}" + echo " s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_HYPERPARAMETERS_KEY}" + echo "Using image ${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION}" + echo "" +else + echo "Training aborted. Configuration files were not found." + echo "Manually check that the following files exist:" + echo " s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_REWARD_KEY}" + echo " s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_MODEL_METADATA_KEY}" + echo " s3://${DR_LOCAL_S3_BUCKET}/${DR_LOCAL_S3_HYPERPARAMETERS_KEY}" + echo "You might have to run dr-upload-custom files." + exit 1 +fi + +# Check if model path exists. +S3_PATH="s3://$DR_LOCAL_S3_BUCKET/$DR_LOCAL_S3_MODEL_PREFIX" + +S3_FILES=$(aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 ls ${S3_PATH} | wc -l) +if [[ "$S3_FILES" -gt 0 ]]; then + if [[ -z $OPT_WIPE ]]; then + echo "Selected path $S3_PATH exists. Delete it, or use -w option. Exiting." + exit 1 + else + echo "Wiping path $S3_PATH." + aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 rm --recursive ${S3_PATH} + fi +fi + +# Base compose file +if [ ${DR_ROBOMAKER_MOUNT_LOGS,,} = "true" ]; then + COMPOSE_FILES="$DR_TRAIN_COMPOSE_FILE $DR_DOCKER_FILE_SEP $DR_DIR/docker/docker-compose-mount.yml" + export DR_MOUNT_DIR="$DR_DIR/data/logs/robomaker/$DR_LOCAL_S3_MODEL_PREFIX" + mkdir -p $DR_MOUNT_DIR +else + COMPOSE_FILES="$DR_TRAIN_COMPOSE_FILE" +fi + +export DR_CURRENT_PARAMS_FILE=${DR_LOCAL_S3_TRAINING_PARAMS_FILE} + +WORKER_CONFIG=$(python3 $DR_DIR/scripts/training/prepare-config.py) + +if [ "$DR_WORKERS" -gt 1 ]; then + echo "Starting $DR_WORKERS workers" + + if [[ "${DR_DOCKER_STYLE,,}" != "swarm" ]]; then + mkdir -p $DR_DIR/tmp/comms.$DR_RUN_ID + rm -rf $DR_DIR/tmp/comms.$DR_RUN_ID/* + COMPOSE_FILES="$COMPOSE_FILES $DR_DOCKER_FILE_SEP $DR_DIR/docker/docker-compose-robomaker-multi.yml" + fi + + if [ "$DR_TRAIN_MULTI_CONFIG" == "True" ]; then + export MULTI_CONFIG=$WORKER_CONFIG + echo "Multi-config training, creating multiple Robomaker configurations in $S3_PATH" + else + echo "Creating Robomaker configuration in $S3_PATH/$DR_LOCAL_S3_TRAINING_PARAMS_FILE" + fi + export ROBOMAKER_COMMAND="/opt/ml/code/run.sh multi distributed_training.launch.py" + +else + export ROBOMAKER_COMMAND="/opt/ml/code/run.sh run distributed_training.launch.py" + echo "Creating Robomaker configuration in $S3_PATH/$DR_LOCAL_S3_TRAINING_PARAMS_FILE" +fi + +# Check if we are using Host X -- ensure variables are populated +if [[ "${DR_HOST_X,,}" == "true" ]]; then + if [[ -n "$DR_DISPLAY" ]]; then + ROBO_DISPLAY=$DR_DISPLAY + else + ROBO_DISPLAY=$DISPLAY + fi + + if ! DISPLAY=$ROBO_DISPLAY timeout 1s xset q &>/dev/null; then + echo "No X Server running on display $ROBO_DISPLAY. Exiting" + exit 1 + fi + + if [[ -z "$XAUTHORITY" && "$IS_WSL2" != "yes" ]]; then + export XAUTHORITY=~/.Xauthority + if [[ ! -f "$XAUTHORITY" ]]; then + echo "No XAUTHORITY defined. .Xauthority does not exist. Stopping." + exit 1 + fi + fi + +fi + +# Check if we will use Docker Swarm or Docker Compose +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + ROBOMAKER_NODES=$(docker node ls --format '{{.ID}}' | xargs docker inspect | jq '.[] | select (.Spec.Labels.Robomaker == "true") | .ID' | wc -l) + if [[ "$ROBOMAKER_NODES" -eq 0 ]]; then + echo "ERROR: No Swarm Nodes labelled for placement of Robomaker. Please add Robomaker node." + echo " Example: docker node update --label-add Robomaker=true $(docker node inspect self | jq .[0].ID -r)" + exit 1 + fi + + SAGEMAKER_NODES=$(docker node ls --format '{{.ID}}' | xargs docker inspect | jq '.[] | select (.Spec.Labels.Sagemaker == "true") | .ID' | wc -l) + if [[ "$SAGEMAKER_NODES" -eq 0 ]]; then + echo "ERROR: No Swarm Nodes labelled for placement of Sagemaker. Please add Sagemaker node." + echo " Example: docker node update --label-add Sagemaker=true $(docker node inspect self | jq .[0].ID -r)" + exit 1 + fi + + if [ "$DR_DOCKER_MAJOR_VERSION" -gt 24 ]; then + DETACH_FLAG="--detach=true" + fi + + DISPLAY=$ROBO_DISPLAY docker stack deploy $COMPOSE_FILES $DETACH_FLAG $STACK_NAME + +else + DISPLAY=$ROBO_DISPLAY docker compose $COMPOSE_FILES -p $STACK_NAME up -d --scale robomaker=$DR_WORKERS +fi + +# Viewer +if [ -n "$OPT_VIEWER" ]; then + ( + sleep 5 + dr-update-viewer + ) +fi + +# Request to be quiet. Quitting here. +if [ -n "$OPT_QUIET" ]; then + exit 0 +fi + +# Trigger requested log-file +if [[ "${OPT_DISPLAY,,}" == "all" && -n "${DISPLAY}" && "${DR_HOST_X,,}" == "true" ]]; then + dr-logs-sagemaker -w 15 + if [ "${DR_WORKERS}" -gt 1 ]; then + for i in $(seq 1 ${DR_WORKERS}); do + dr-logs-robomaker -w 15 -n $i + done + else + dr-logs-robomaker -w 15 + fi +elif [[ "${OPT_DISPLAY,,}" == "robomaker" ]]; then + dr-logs-robomaker -w 15 -n $OPT_ROBOMAKER +elif [[ "${OPT_DISPLAY,,}" == "sagemaker" ]]; then + dr-logs-sagemaker -w 15 +fi diff --git a/scripts/training/stop.sh b/scripts/training/stop.sh index bb1776fd..a02591c5 100755 --- a/scripts/training/stop.sh +++ b/scripts/training/stop.sh @@ -1,6 +1,40 @@ #!/usr/bin/env bash +STACK_NAME="deepracer-$DR_RUN_ID" +RUN_NAME=${DR_LOCAL_S3_MODEL_PREFIX} -docker-compose -f ../../docker/docker-compose.yml down +SAGEMAKER_CONTAINERS=$(docker ps | awk ' /simapp/ { print $1 } ' | xargs) -docker stop $(docker ps | awk ' /sagemaker/ { print $1 }') -docker rm $(docker ps -a | awk ' /sagemaker/ { print $1 }') \ No newline at end of file +if [[ -n "$SAGEMAKER_CONTAINERS" ]]; then + for CONTAINER in $SAGEMAKER_CONTAINERS; do + CONTAINER_NAME=$(docker ps --format '{{.Names}}' --filter id=$CONTAINER) + CONTAINER_PREFIX=$(echo $CONTAINER_NAME | perl -n -e'/(.*)-(algo-(.)-(.*))/; print $1') + COMPOSE_SERVICE_NAME=$(echo $CONTAINER_NAME | perl -n -e'/(.*)-(algo-(.)-(.*))/; print $2') + + if [[ -n "$COMPOSE_SERVICE_NAME" ]]; then + COMPOSE_FILES=$(sudo find /tmp/sagemaker -name docker-compose.yaml -exec grep -l "$COMPOSE_SERVICE_NAME" {} +) + for COMPOSE_FILE in $COMPOSE_FILES; do + if sudo grep -q "RUN_ID=${DR_RUN_ID}" $COMPOSE_FILE && sudo grep -q "${RUN_NAME}" $COMPOSE_FILE; then + echo Found Sagemaker as $CONTAINER_NAME + + # Check if Docker version is greater than 24 + if [ "$DR_DOCKER_MAJOR_VERSION" -gt 24 ]; then + # Remove version tag from docker-compose.yaml + sudo sed -i '/^version:/d' $COMPOSE_FILE + fi + + sudo docker compose -f $COMPOSE_FILE stop $COMPOSE_SERVICE_NAME + docker container rm $CONTAINER -v >/dev/null + fi + done + fi + done +fi + +# Check if we will use Docker Swarm or Docker Compose +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + docker stack rm $STACK_NAME +else + COMPOSE_FILES=$(echo ${DR_TRAIN_COMPOSE_FILE} | cut -f1-2 -d\ ) + export DR_CURRENT_PARAMS_FILE="" + docker compose $COMPOSE_FILES -p $STACK_NAME down +fi diff --git a/scripts/training/upload-snapshot.sh b/scripts/training/upload-snapshot.sh deleted file mode 100755 index 99a767e9..00000000 --- a/scripts/training/upload-snapshot.sh +++ /dev/null @@ -1,98 +0,0 @@ -#!/usr/bin/env bash - -S3_BUCKET={replace with your own S3 bucket name} - -S3_PREFIX={replace with your own S3 prefix} - -MODEL_DIR=$(pwd)/../../docker/volumes/minio/bucket/rl-deepracer-sagemaker/model/ - -display_usage() { - echo -e "\nUsage:\n./upload-snapshot.sh -c checkpoint \n" -} - -# check whether user had supplied -h or --help . If yes display usage -if [[ ( $# == "--help") || $# == "-h" ]]; then -display_usage -exit 0 -fi - -while getopts ":c:" opt; do -case $opt in -c) CHECKPOINT="$OPTARG" -;; -\?) echo "Invalid option -$OPTARG" >&2 -;; -esac -done - -# echo 'checkpoint recieved: ' ${CHECKPOINT} - -if [ -z "$CHECKPOINT" ]; then - echo "Checkpoint not supplied, checking for latest checkpoint" - CHECKPOINT_FILE=$MODEL_DIR"checkpoint" - - if [ ! -f ${CHECKPOINT_FILE} ]; then - echo "Checkpoint file not found!" - return 1 - else - echo "found checkpoint index file "$CHECKPOINT_FILE - fi; - - FIRST_LINE=$(head -n 1 $CHECKPOINT_FILE) - CHECKPOINT=`echo $FIRST_LINE | sed "s/[model_checkpoint_path: [^ ]*//"` - CHECKPOINT=`echo $CHECKPOINT | sed 's/[_][^ ]*//'` - CHECKPOINT=`echo $CHECKPOINT | sed 's/"//g'` - echo "latest checkpoint = "$CHECKPOINT -else - echo "Checkpoint supplied: ["${CHECKPOINT}"]" -fi - -mkdir -p checkpoint -MODEL_FILE=$MODEL_DIR"model_"$CHECKPOINT".pb" -METADATA_FILE=$MODEL_DIR"model_metadata.json" - - -if test ! -f "$MODEL_FILE"; then - echo "$MODEL_FILE doesn't exist" - return 1 -else - cp $MODEL_FILE checkpoint/ -fi - -if test ! -f "$METADATA_FILE"; then - echo "$METADATA_FILE doesn't exist" - return 1 -else - cp $METADATA_FILE checkpoint/ -fi - - -for i in $( find $MODEL_DIR -type f -name $CHECKPOINT"*" ); do - cp $i checkpoint/ -done - -ls ${MODEL_DIR}${CHECKPOINT}_Step-*.ckpt.index | xargs -n 1 basename | sed 's/[.][^ ]*//' - -CONTENT=$(ls ${MODEL_DIR}${CHECKPOINT}_Step-*.ckpt.index | xargs -n 1 basename | sed 's/[.][^ ]*//') -echo ${CONTENT} - -echo 'model_checkpoint_path: "'${CONTENT}'.ckpt"' > checkpoint/checkpoint - -# # upload files to s3 -for filename in checkpoint/*; do - aws s3 cp $filename s3://$S3_BUCKET/$S3_PREFIX/model/ -done - -tar -czvf ${CHECKPOINT}-checkpoint.tar.gz checkpoint/* - -rm -rf checkpoint -echo 'done uploading model!' - - - - - - - - - diff --git a/scripts/upload/download-model.sh b/scripts/upload/download-model.sh new file mode 100755 index 00000000..e9cb898b --- /dev/null +++ b/scripts/upload/download-model.sh @@ -0,0 +1,107 @@ +#!/bin/bash + +usage() { + echo "Usage: $0 [-f] [-w] [-d] -s -t " + echo " -f Force download. No confirmation question." + echo " -w Wipes the target AWS DeepRacer model structure before upload." + echo " -d Dry-Run mode. Does not perform any write or delete operatios on target." + echo " -c Copy config files into custom_files." + echo " -s source-url Downloads model from specified S3 URL (s3://bucket/prefix)." + echo " -t target-prefix Downloads model into specified prefix in local storage." + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +while getopts "s:t:fwcdh" opt; do + case $opt in + f) + OPT_FORCE="True" + ;; + c) + OPT_CONFIG="Config" + ;; + d) + OPT_DRYRUN="--dryrun" + ;; + w) + OPT_WIPE="--delete" + ;; + t) + OPT_TARGET="$OPTARG" + ;; + s) + OPT_SOURCE="$OPTARG" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +if [[ -n "${OPT_DRYRUN}" ]]; then + echo "*** DRYRUN MODE ***" +fi + +SOURCE_S3_URL="${OPT_SOURCE}" + +if [[ -z "${SOURCE_S3_URL}" ]]; then + echo "No source URL to download model from." + exit 1 +fi + +TARGET_S3_BUCKET=${DR_LOCAL_S3_BUCKET} +TARGET_S3_PREFIX=${OPT_TARGET} +if [[ -z "${TARGET_S3_PREFIX}" ]]; then + echo "No target prefix defined. Exiting." + exit 1 +fi + +SOURCE_REWARD_FILE_S3_KEY="${SOURCE_S3_URL}/reward_function.py" +SOURCE_HYPERPARAM_FILE_S3_KEY="${SOURCE_S3_URL}/ip/hyperparameters.json" +SOURCE_METADATA_S3_KEY="${SOURCE_S3_URL}/model/model_metadata.json" + +WORK_DIR=${DR_DIR}/tmp/download +mkdir -p ${WORK_DIR} && rm -rf ${WORK_DIR} && mkdir -p ${WORK_DIR}/config ${WORK_DIR}/full + +# Check if metadata-files are available +REWARD_FILE=$(aws ${DR_UPLOAD_PROFILE} s3 cp "${SOURCE_REWARD_FILE_S3_KEY}" ${WORK_DIR}/config/ --no-progress | awk '/reward/ {print $4}' | xargs readlink -f 2>/dev/null) +METADATA_FILE=$(aws ${DR_UPLOAD_PROFILE} s3 cp "${SOURCE_METADATA_S3_KEY}" ${WORK_DIR}/config/ --no-progress | awk '/model_metadata.json$/ {print $4}' | xargs readlink -f 2>/dev/null) +HYPERPARAM_FILE=$(aws ${DR_UPLOAD_PROFILE} s3 cp "${SOURCE_HYPERPARAM_FILE_S3_KEY}" ${WORK_DIR}/config/ --no-progress | awk '/hyperparameters.json$/ {print $4}' | xargs readlink -f 2>/dev/null) + +if [ -n "$METADATA_FILE" ] && [ -n "$REWARD_FILE" ] && [ -n "$HYPERPARAM_FILE" ]; then + echo "All meta-data files found. Source model ${SOURCE_S3_URL} valid." +else + echo "Meta-data files are not found. Source model ${SOURCE_S3_URL} not valid. Exiting." + exit 1 +fi + +# Upload files +if [[ -z "${OPT_FORCE}" ]]; then + echo "Ready to download model ${SOURCE_S3_URL} to local ${TARGET_S3_PREFIX}" + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi +fi + +cd ${WORK_DIR} +aws ${DR_UPLOAD_PROFILE} s3 sync "${SOURCE_S3_URL}" ${WORK_DIR}/full/ ${OPT_DRYRUN} +aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 sync ${WORK_DIR}/full/ s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/ ${OPT_DRYRUN} ${OPT_WIPE} + +if [[ -n "${OPT_CONFIG}" ]]; then + echo "Copy configuration to custom_files" + cp ${WORK_DIR}/config/* ${DR_DIR}/custom_files/ +fi + +echo "Done." diff --git a/scripts/upload/increment.sh b/scripts/upload/increment.sh new file mode 100755 index 00000000..d39bdec3 --- /dev/null +++ b/scripts/upload/increment.sh @@ -0,0 +1,92 @@ +#!/bin/bash + +usage() { + echo "Usage: $0 [-f] [-w] [-p ] [-d ]" + echo "" + echo "Command will increment a numerical suffix on the current upload model." + echo "-p model Sets the to-be name to be rather than auto-incremeneting the previous model." + echo "-d delim Delimiter in model-name (e.g. '-' in 'test-model-1')" + echo "-f Force. Ask for no confirmations." + echo "-w Wipe the S3 prefix to ensure that two models are not mixed." + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +OPT_DELIM='-' + +while getopts ":fwp:d:" opt; do + case $opt in + + f) + OPT_FORCE="True" + ;; + p) + OPT_PREFIX="$OPTARG" + ;; + w) + OPT_WIPE="--delete" + ;; + d) + OPT_DELIM="$OPTARG" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +CONFIG_FILE=$DR_CONFIG +echo "Configuration file $CONFIG_FILE will be updated." + +## Read in data +CURRENT_UPLOAD_MODEL=$(grep -e "^DR_UPLOAD_S3_PREFIX" ${CONFIG_FILE} | awk '{split($0,a,"="); print a[2] }') +CURRENT_UPLOAD_MODEL_NUM=$(echo "${CURRENT_UPLOAD_MODEL}" | + awk -v DELIM="${OPT_DELIM}" '{ n=split($0,a,DELIM); if (a[n] ~ /[0-9]*/) print a[n]; else print ""; }') +if [[ -z ${CURRENT_UPLOAD_MODEL_NUM} ]]; then + NEW_UPLOAD_MODEL="${CURRENT_UPLOAD_MODEL}${OPT_DELIM}1" +else + NEW_UPLOAD_MODEL_NUM=$(echo "${CURRENT_UPLOAD_MODEL_NUM} + 1" | bc) + NEW_UPLOAD_MODEL=$(echo $CURRENT_UPLOAD_MODEL | sed "s/${CURRENT_UPLOAD_MODEL_NUM}\$/${NEW_UPLOAD_MODEL_NUM}/") +fi + +if [[ -n "${NEW_UPLOAD_MODEL}" ]]; then + echo "Incrementing model from ${CURRENT_UPLOAD_MODEL} to ${NEW_UPLOAD_MODEL}" + if [[ -z "${OPT_FORCE}" ]]; then + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi + fi + sed -i.bak -re "s/(DR_UPLOAD_S3_PREFIX=).*$/\1$NEW_UPLOAD_MODEL/g" "$CONFIG_FILE" && echo "Done." +else + echo "Error in determining new model. Aborting." + exit 1 +fi + +export DR_UPLOAD_S3_PREFIX=$(eval echo "${NEW_UPLOAD_MODEL}") + +if [[ -n "${OPT_WIPE}" ]]; then + MODEL_DIR_S3=$(aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 ls s3://${DR_LOCAL_S3_BUCKET}/${NEW_UPLOAD_MODEL}) + if [[ -n "${MODEL_DIR_S3}" ]]; then + echo "The new model's S3 prefix s3://${DR_LOCAL_S3_BUCKET}/${NEW_UPLOAD_MODEL} exists. Will wipe." + fi + if [[ -z "${OPT_FORCE}" ]]; then + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi + fi + aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 rm s3://${DR_LOCAL_S3_BUCKET}/${NEW_UPLOAD_MODEL} --recursive +fi diff --git a/scripts/upload/prepare-config.py b/scripts/upload/prepare-config.py new file mode 100755 index 00000000..a7520a39 --- /dev/null +++ b/scripts/upload/prepare-config.py @@ -0,0 +1,67 @@ +#!/usr/bin/python3 + +import boto3 +import sys +import os +import time +import json +import io +import yaml + +config = {} +config['AWS_REGION'] = os.environ.get('DR_AWS_APP_REGION', 'us-east-1') +config['JOB_TYPE'] = 'TRAINING' +config['METRICS_S3_BUCKET'] = os.environ.get('TARGET_S3_BUCKET', 'bucket') +config['METRICS_S3_OBJECT_KEY'] = "{}/TrainingMetrics.json".format(os.environ.get('TARGET_S3_PREFIX', 'bucket')) +config['MODEL_METADATA_FILE_S3_KEY'] = "{}/model/model_metadata.json".format(os.environ.get('TARGET_S3_PREFIX', 'bucket')) +config['REWARD_FILE_S3_KEY'] = "{}/reward_function.py".format(os.environ.get('TARGET_S3_PREFIX', 'bucket')) +config['SAGEMAKER_SHARED_S3_BUCKET'] = os.environ.get('TARGET_S3_BUCKET', 'bucket') +config['SAGEMAKER_SHARED_S3_PREFIX'] = os.environ.get('TARGET_S3_PREFIX', 'rl-deepracer-sagemaker') + +# Car and training +config['BODY_SHELL_TYPE'] = os.environ.get('DR_CAR_BODY_SHELL_TYPE', 'deepracer') +if config['BODY_SHELL_TYPE'] == 'deepracer': + config['CAR_COLOR'] = os.environ.get('DR_CAR_COLOR', 'Red') +config['CAR_NAME'] = os.environ.get('DR_CAR_NAME', 'MyCar') +config['RACE_TYPE'] = os.environ.get('DR_RACE_TYPE', 'TIME_TRIAL') +config['WORLD_NAME'] = os.environ.get('DR_WORLD_NAME', 'LGSWide') +config['DISPLAY_NAME'] = os.environ.get('DR_DISPLAY_NAME', 'racer1') +config['RACER_NAME'] = os.environ.get('DR_RACER_NAME', 'racer1') + +config['ALTERNATE_DRIVING_DIRECTION'] = os.environ.get('DR_TRAIN_ALTERNATE_DRIVING_DIRECTION', os.environ.get('DR_ALTERNATE_DRIVING_DIRECTION', 'false')) +config['CHANGE_START_POSITION'] = os.environ.get('DR_TRAIN_CHANGE_START_POSITION', os.environ.get('DR_CHANGE_START_POSITION', 'true')) +config['ROUND_ROBIN_ADVANCE_DIST'] = os.environ.get('DR_TRAIN_ROUND_ROBIN_ADVANCE_DIST', '0.05') +config['START_POSITION_OFFSET'] = os.environ.get('DR_TRAIN_START_POSITION_OFFSET', '0.00') +config['ENABLE_DOMAIN_RANDOMIZATION'] = os.environ.get('DR_ENABLE_DOMAIN_RANDOMIZATION', 'false') +config['MIN_EVAL_TRIALS'] = os.environ.get('DR_TRAIN_MIN_EVAL_TRIALS', '5') + +# Object Avoidance +if config['RACE_TYPE'] == 'OBJECT_AVOIDANCE': + config['NUMBER_OF_OBSTACLES'] = os.environ.get('DR_OA_NUMBER_OF_OBSTACLES', '6') + config['MIN_DISTANCE_BETWEEN_OBSTACLES'] = os.environ.get('DR_OA_MIN_DISTANCE_BETWEEN_OBSTACLES', '2.0') + config['RANDOMIZE_OBSTACLE_LOCATIONS'] = os.environ.get('DR_OA_RANDOMIZE_OBSTACLE_LOCATIONS', 'True') + config['IS_OBSTACLE_BOT_CAR'] = os.environ.get('DR_OA_IS_OBSTACLE_BOT_CAR', 'false') + + object_position_str = os.environ.get('DR_OA_OBJECT_POSITIONS', "") + if object_position_str != "": + object_positions = [] + for o in object_position_str.split(";"): + object_positions.append(o) + config['OBJECT_POSITIONS'] = object_positions + config['NUMBER_OF_OBSTACLES'] = str(len(object_positions)) + +# Head to Bot +if config['RACE_TYPE'] == 'HEAD_TO_BOT': + config['IS_LANE_CHANGE'] = os.environ.get('DR_H2B_IS_LANE_CHANGE', 'False') + config['LOWER_LANE_CHANGE_TIME'] = os.environ.get('DR_H2B_LOWER_LANE_CHANGE_TIME', '3.0') + config['UPPER_LANE_CHANGE_TIME'] = os.environ.get('DR_H2B_UPPER_LANE_CHANGE_TIME', '5.0') + config['LANE_CHANGE_DISTANCE'] = os.environ.get('DR_H2B_LANE_CHANGE_DISTANCE', '1.0') + config['NUMBER_OF_BOT_CARS'] = os.environ.get('DR_H2B_NUMBER_OF_BOT_CARS', '0') + config['MIN_DISTANCE_BETWEEN_BOT_CARS'] = os.environ.get('DR_H2B_MIN_DISTANCE_BETWEEN_BOT_CARS', '2.0') + config['RANDOMIZE_BOT_CAR_LOCATIONS'] = os.environ.get('DR_H2B_RANDOMIZE_BOT_CAR_LOCATIONS', 'False') + config['BOT_CAR_SPEED'] = os.environ.get('DR_H2B_BOT_CAR_SPEED', '0.2') + +local_yaml_path = os.path.abspath(os.path.join(os.environ.get('WORK_DIR'),'training_params.yaml')) +print(local_yaml_path) +with open(local_yaml_path, 'w') as yaml_file: + yaml.dump(config, yaml_file, default_flow_style=False, default_style='\'', explicit_start=True) \ No newline at end of file diff --git a/scripts/upload/upload-car.sh b/scripts/upload/upload-car.sh new file mode 100755 index 00000000..6b1815b9 --- /dev/null +++ b/scripts/upload/upload-car.sh @@ -0,0 +1,78 @@ +#!/bin/bash + +usage() { + echo "Usage: $0 [-L] [-f]" + echo " -f Force. Do not ask for confirmation." + echo " -L Upload model to the local S3 bucket." + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +while getopts ":Lf" opt; do + case $opt in + L) + OPT_LOCAL="Local" + ;; + f) + OPT_FORCE="force" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +# This script creates the tar.gz file necessary to operate inside a deepracer physical car +# The file is created directly from within the sagemaker container, using the most recent checkpoint + +# Find name of sagemaker container +SAGEMAKER_CONTAINERS=$(docker ps | awk ' /algo/ { print $1 } ' | xargs) +if [[ -n $SAGEMAKER_CONTAINERS ]]; then + for CONTAINER in $SAGEMAKER_CONTAINERS; do + CONTAINER_NAME=$(docker ps --format '{{.Names}}' --filter id=$CONTAINER) + CONTAINER_PREFIX=$(echo $CONTAINER_NAME | perl -n -e'/(.*)_(algo(.*))_./; print $1') + echo "Found Sagemaker container: $CONTAINER_NAME" + done +fi + +#create tmp directory if it doesnt already exit +mkdir -p $DR_DIR/tmp/car_upload +cd $DR_DIR/tmp/car_upload +#ensure directory is empty +rm -r $DR_DIR/tmp/car_upload/* +#The files we want are located inside the sagemaker container at /opt/ml/model. Copy them to the tmp directory +docker cp $CONTAINER_NAME:/opt/ml/model $DR_DIR/tmp/car_upload +cd $DR_DIR/tmp/car_upload/model +#create a tar.gz file containing all of these files +tar -czvf carfile.tar.gz * + +# Upload files +if [[ -z "${OPT_FORCE}" ]]; then + if [[ -n "${OPT_LOCAL}" ]]; then + echo "Ready to upload car model to local s3://${DR_LOCAL_S3_BUCKET}/${DR_UPLOAD_S3_PREFIX}." + else + echo "Ready to upload car model to remote s3://${DR_UPLOAD_S3_BUCKET}/${DR_UPLOAD_S3_PREFIX}." + fi + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi +fi + +#upload to s3 +if [[ -n "${OPT_LOCAL}" ]]; then + aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 cp carfile.tar.gz s3://${DR_LOCAL_S3_BUCKET}/${DR_UPLOAD_S3_PREFIX}/carfile.tar.gz +else + aws ${DR_UPLOAD_PROFILE} s3 cp carfile.tar.gz s3://${DR_UPLOAD_S3_BUCKET}/${DR_UPLOAD_S3_PREFIX}/carfile.tar.gz +fi diff --git a/scripts/upload/upload-model.sh b/scripts/upload/upload-model.sh new file mode 100755 index 00000000..9f9d1e05 --- /dev/null +++ b/scripts/upload/upload-model.sh @@ -0,0 +1,199 @@ +#!/bin/bash + +usage() { + echo "Usage: $0 [-f] [-w] [-d] [-b] [-1] [-i] [-I] [-L] [-c ] [-p ]" + echo " -f Force upload. No confirmation question." + echo " -w Wipes the target AWS DeepRacer model structure before upload." + echo " -d Dry-Run mode. Does not perform any write or delete operatios on target." + echo " -b Uploads best checkpoint. Default is last checkpoint." + echo " -p model Uploads model from specified S3 prefix." + echo " -1 Increment upload name with 1 (dr-increment-upload-model)" + echo " -L Upload model to the local S3 bucket" + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +while getopts ":fwdhbp:c:1L" opt; do + case $opt in + b) + OPT_CHECKPOINT="Best" + ;; + c) + OPT_CHECKPOINT_NUM="$OPTARG" + ;; + f) + OPT_FORCE="-f" + ;; + d) + OPT_DRYRUN="--dryrun" + ;; + p) + OPT_PREFIX="$OPTARG" + ;; + w) + OPT_WIPE="--delete" + ;; + L) + OPT_LOCAL="Local" + ;; + 1) + OPT_INCREMENT="Yes" + ;; + h) + usage + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +if [[ -n "${OPT_DRYRUN}" ]]; then + echo "*** DRYRUN MODE ***" +fi + +if [[ -n "${OPT_INCREMENT}" ]]; then + source $DR_DIR/scripts/upload/increment.sh ${OPT_FORCE} +fi + +SOURCE_S3_BUCKET=${DR_LOCAL_S3_BUCKET} +if [[ -n "${OPT_PREFIX}" ]]; then + SOURCE_S3_MODEL_PREFIX=${OPT_PREFIX} + SOURCE_S3_REWARD=${OPT_PREFIX}/reward_function.py + SOURCE_S3_METRICS=${OPT_PREFIX}/metrics + TARGET_S3_PREFIX=${OPT_PREFIX} +else + SOURCE_S3_MODEL_PREFIX=${DR_LOCAL_S3_MODEL_PREFIX} + SOURCE_S3_REWARD=${DR_LOCAL_S3_REWARD_KEY} + SOURCE_S3_METRICS=${DR_LOCAL_S3_METRICS_PREFIX} + TARGET_S3_PREFIX=${DR_UPLOAD_S3_PREFIX} +fi + +if [[ -z "${OPT_LOCAL}" ]]; then + TARGET_S3_BUCKET=${DR_UPLOAD_S3_BUCKET} + UPLOAD_PROFILE=${DR_UPLOAD_PROFILE} +else + if [[ "${TARGET_S3_PREFIX}" = "${SOURCE_S3_MODEL_PREFIX}" ]]; then + echo "Target equals source. Exiting." + exit 1 + fi + + TARGET_S3_BUCKET=${DR_LOCAL_S3_BUCKET} + UPLOAD_PROFILE=${DR_LOCAL_PROFILE_ENDPOINT_URL} +fi + +if [[ -z "${TARGET_S3_BUCKET}" ]]; then + echo "No upload bucket defined. Exiting." + exit 1 +fi + +if [[ -z "${TARGET_S3_PREFIX}" ]]; then + echo "No upload prefix defined. Exiting." + exit 1 +fi + +export WORK_DIR=${DR_DIR}/tmp/upload/ +rm -rf ${WORK_DIR} && mkdir -p ${WORK_DIR}model ${WORK_DIR}ip + +# Upload information on model. +TARGET_PARAMS_FILE_S3_KEY="s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/training_params.yaml" +TARGET_REWARD_FILE_S3_KEY="s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/reward_function.py" +TARGET_HYPERPARAM_FILE_S3_KEY="s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/ip/hyperparameters.json" +TARGET_METRICS_FILE_S3_KEY="s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/metrics/" + +# Check if metadata-files are available +REWARD_IN_ROOT=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 ls s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/reward_function.py 2>/dev/null | wc -l) +if [ "$REWARD_IN_ROOT" -ne 0 ]; then + REWARD_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/reward_function.py ${WORK_DIR} --no-progress | awk '/reward/ {print $4}' | xargs readlink -f 2>/dev/null) +else + echo "Looking for Reward Function in s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_REWARD}" + REWARD_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_REWARD} ${WORK_DIR} --no-progress | awk '/reward/ {print $4}' | xargs readlink -f 2>/dev/null) +fi + +METADATA_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/model/model_metadata.json ${WORK_DIR} --no-progress | awk '/model_metadata.json$/ {print $4}' | xargs readlink -f 2>/dev/null) +HYPERPARAM_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 cp s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/ip/hyperparameters.json ${WORK_DIR} --no-progress | awk '/hyperparameters.json$/ {print $4}' | xargs readlink -f 2>/dev/null) +METRICS_FILE=$(aws $DR_LOCAL_PROFILE_ENDPOINT_URL s3 sync s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_METRICS} ${WORK_DIR}/metrics --no-progress | awk '/metric/ {print $4}' | xargs readlink -f 2>/dev/null) + +if [ -n "$METADATA_FILE" ] && [ -n "$REWARD_FILE" ] && [ -n "$HYPERPARAM_FILE" ] && [ -n "$METRICS_FILE" ]; then + echo "All meta-data files found. Looking for checkpoint." +else + echo "Meta-data files are not found. Exiting." + exit 1 +fi + +# Download checkpoint file +echo "Looking for model to upload from s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/" +CHECKPOINT_INDEX=$(aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 cp s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/model/deepracer_checkpoints.json ${WORK_DIR}model/ --no-progress | awk '{print $4}' | xargs readlink -f 2>/dev/null) + +if [ -z "$CHECKPOINT_INDEX" ]; then + echo "No checkpoint file available at s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/model. Exiting." + exit 1 +fi + +if [ -n "$OPT_CHECKPOINT_NUM" ]; then + echo "Checking for checkpoint $OPT_CHECKPOINT_NUM" + export OPT_CHECKPOINT_NUM + CHECKPOINT_FILE=$(aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 ls s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/model/ | perl -ne'print "$1\n" if /.*\s($ENV{OPT_CHECKPOINT_NUM}_Step-[0-9]{1,7}\.ckpt)\.index/') + CHECKPOINT=$(echo $CHECKPOINT_FILE | cut -f1 -d_) + TIMESTAMP=$(date +%s) + CHECKPOINT_JSON_PART=$(jq -n '{ checkpoint: { name: $name, time_stamp: $timestamp | tonumber, avg_comp_pct: 50.0 } }' --arg name $CHECKPOINT_FILE --arg timestamp $TIMESTAMP) + CHECKPOINT_JSON=$(echo $CHECKPOINT_JSON_PART | jq '. | {last_checkpoint: .checkpoint, best_checkpoint: .checkpoint}') +elif [ -z "$OPT_CHECKPOINT" ]; then + echo "Checking for latest tested checkpoint" + CHECKPOINT_FILE=$(jq -r .last_checkpoint.name <$CHECKPOINT_INDEX) + CHECKPOINT=$(echo $CHECKPOINT_FILE | cut -f1 -d_) + CHECKPOINT_JSON=$(jq '. | {last_checkpoint: .last_checkpoint, best_checkpoint: .last_checkpoint}' <$CHECKPOINT_INDEX) + echo "Latest checkpoint = $CHECKPOINT" +else + echo "Checking for best checkpoint" + CHECKPOINT_FILE=$(jq -r .best_checkpoint.name <$CHECKPOINT_INDEX) + CHECKPOINT=$(echo $CHECKPOINT_FILE | cut -f1 -d_) + CHECKPOINT_JSON=$(jq '. | {last_checkpoint: .best_checkpoint, best_checkpoint: .best_checkpoint}' <$CHECKPOINT_INDEX) + echo "Best checkpoint: $CHECKPOINT" +fi + +# Find checkpoint & model files - download +if [ -n "$CHECKPOINT" ]; then + CHECKPOINT_MODEL_FILES=$(aws ${DR_LOCAL_PROFILE_ENDPOINT_URL} s3 sync s3://${SOURCE_S3_BUCKET}/${SOURCE_S3_MODEL_PREFIX}/model/ ${WORK_DIR}model/ --exclude "*" --include "${CHECKPOINT}*" --include "model_${CHECKPOINT}.pb" --include "deepracer_checkpoints.json" --no-progress | awk '{print $4}' | xargs readlink -f 2>/dev/null) + CHECKPOINT_MODEL_FILE_COUNT=$(echo $CHECKPOINT_MODEL_FILES | wc -l) + if [ "$CHECKPOINT_MODEL_FILE_COUNT" -eq 0 ]; then + echo "No model files found. Files possibly deleted. Try again." + exit 1 + fi + cp ${METADATA_FILE} ${WORK_DIR}model/ + # echo "model_checkpoint_path: \"${CHECKPOINT_FILE}\"" | tee ${WORK_DIR}model/checkpoint + echo ${CHECKPOINT_FILE} | tee ${WORK_DIR}model/.coach_checkpoint >/dev/null +else + echo "Checkpoint not found. Exiting." + exit 1 +fi + +# Create Training Params Yaml. +PARAMS_FILE=$(python3 $DR_DIR/scripts/upload/prepare-config.py) + +# Upload files +if [[ -z "${OPT_FORCE}" ]]; then + echo "Ready to upload model ${SOURCE_S3_MODEL_PREFIX} to s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/" + read -r -p "Are you sure? [y/N] " response + if [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then + echo "Aborting." + exit 1 + fi +fi + +# echo "" > ${WORK_DIR}model/.ready +cd ${WORK_DIR} +echo ${CHECKPOINT_JSON} >${WORK_DIR}model/deepracer_checkpoints.json +aws ${UPLOAD_PROFILE} s3 sync ${WORK_DIR}model/ s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/model/ ${OPT_DRYRUN} ${OPT_WIPE} +aws ${UPLOAD_PROFILE} s3 cp ${REWARD_FILE} ${TARGET_REWARD_FILE_S3_KEY} ${OPT_DRYRUN} +aws ${UPLOAD_PROFILE} s3 sync ${WORK_DIR}/metrics/ ${TARGET_METRICS_FILE_S3_KEY} ${OPT_DRYRUN} +aws ${UPLOAD_PROFILE} s3 cp ${PARAMS_FILE} ${TARGET_PARAMS_FILE_S3_KEY} ${OPT_DRYRUN} +aws ${UPLOAD_PROFILE} s3 cp ${HYPERPARAM_FILE} ${TARGET_HYPERPARAM_FILE_S3_KEY} ${OPT_DRYRUN} +aws ${UPLOAD_PROFILE} s3 cp ${METADATA_FILE} s3://${TARGET_S3_BUCKET}/${TARGET_S3_PREFIX}/ ${OPT_DRYRUN} diff --git a/scripts/viewer/index.template.html b/scripts/viewer/index.template.html new file mode 100755 index 00000000..1ed79dd5 --- /dev/null +++ b/scripts/viewer/index.template.html @@ -0,0 +1,456 @@ + + + + + DR-$DR_RUN_ID - $DR_LOCAL_S3_MODEL_PREFIX + + + + +
+ +
+
+ +
+ + + + + \ No newline at end of file diff --git a/scripts/viewer/start.sh b/scripts/viewer/start.sh new file mode 100755 index 00000000..065cbc7a --- /dev/null +++ b/scripts/viewer/start.sh @@ -0,0 +1,138 @@ +#!/usr/bin/env bash + +usage() { + echo "Usage: $0 [-t topic] [-w width] [-h height] [-q quality] -b [browser-command] -p [port]" + echo " -w Width of individual stream." + echo " -h Heigth of individual stream." + echo " -q Quality of the stream image." + echo " -t Topic to follow - default /racecar/deepracer/kvs_stream" + echo " -b Browser command (default: firefox --new-tab)" + echo " -p The port to use " + exit 1 +} + +trap ctrl_c INT + +function ctrl_c() { + echo "Requested to stop." + exit 1 +} + +# Stream definition +TOPIC="/racecar/deepracer/kvs_stream" +WIDTH=480 +HEIGHT=360 +QUALITY=75 +BROWSER=${BROWSER:-"firefox --new-tab"} +PORT=$DR_WEBVIEWER_PORT + +while getopts ":w:h:q:t:b:p:" opt; do + case $opt in + w) + WIDTH="$OPTARG" + ;; + h) + HEIGHT="$OPTARG" + ;; + q) + QUALITY="$OPTARG" + ;; + t) + TOPIC="$OPTARG" + ;; + b) + BROWSER="$OPTARG" + ;; + p) + PORT="$OPTARG" + ;; + \?) + echo "Invalid option -$OPTARG" >&2 + usage + ;; + esac +done + +DR_WEBVIEWER_PORT=$PORT + +export DR_VIEWER_HTML=$DR_DIR/tmp/streams-$DR_RUN_ID.html +export DR_NGINX_CONF=$DR_DIR/tmp/streams-$DR_RUN_ID.conf + +cat <$DR_NGINX_CONF +server { + listen 80; + location / { + root /usr/share/nginx/html; + index index.html index.htm; + } +EOF + +if [[ "${DR_DOCKER_STYLE,,}" != "swarm" ]]; then + ROBOMAKER_CONTAINERS=$(docker ps --format "{{.ID}} {{.Names}}" --filter name="deepracer-${DR_RUN_ID}" | grep robomaker | cut -f1 -d\ ) +else + ROBOMAKER_SERVICE_REPLICAS=$(docker service ps deepracer-${DR_RUN_ID}_robomaker | awk '/robomaker/ { print $1 }') + for c in $ROBOMAKER_SERVICE_REPLICAS; do + ROBOMAKER_CONTAINER_IP=$(docker inspect $c | jq -r '.[].NetworksAttachments[] | select (.Network.Spec.Name == "sagemaker-local") | .Addresses[0] ' | cut -f1 -d/) + ROBOMAKER_CONTAINERS="${ROBOMAKER_CONTAINERS} ${ROBOMAKER_CONTAINER_IP}" + done +fi + +if [ -z "$ROBOMAKER_CONTAINERS" ]; then + echo "No running robomakers. Exiting." + exit +fi + +# Expose the diamensions to the HTML template +export QUALITY +export WIDTH +export HEIGHT +# Create .js array of robomakers to pass to the HTML template +export ROBOMAKER_CONTAINERS_HTML="" +for c in $ROBOMAKER_CONTAINERS; do + ROBOMAKER_CONTAINERS_HTML+="'$c'," +done +SCRIPT_PATH="${BASH_SOURCE:-$0}" +ABS_SCRIPT_PATH="$(realpath "${SCRIPT_PATH}")" +ABS_DIRECTORY="$(dirname "${ABS_SCRIPT_PATH}")" +INDEX_HTML_TEMPLATE="${ABS_DIRECTORY}/index.template.html" +# Replace all variables in HTML template and create the viewer html file +envsubst <"${INDEX_HTML_TEMPLATE}" >$DR_VIEWER_HTML + +# Add proxy paths in the NGINX file +for c in $ROBOMAKER_CONTAINERS; do + echo " location /$c { proxy_pass http://$c:8080; rewrite /$c/(.*) /\$1 break; }" >>$DR_NGINX_CONF +done +echo "}" >>$DR_NGINX_CONF + +# Check if we will use Docker Swarm or Docker Compose +STACK_NAME="deepracer-$DR_RUN_ID-viewer" +COMPOSE_FILES=$DR_DIR/docker/docker-compose-webviewer.yml + +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + if [ "$DR_DOCKER_MAJOR_VERSION" -gt 24 ]; then + DETACH_FLAG="--detach=true" + fi + + COMPOSE_FILES="$COMPOSE_FILES -c $DR_DIR/docker/docker-compose-webviewer-swarm.yml" + docker stack deploy -c $COMPOSE_FILES $DETACH_FLAG $STACK_NAME +else + docker compose -f $COMPOSE_FILES -p $STACK_NAME up -d +fi + +# Starting browser if using local X and having display defined. +if [[ -n "${DISPLAY}" && "${DR_HOST_X,,}" == "true" ]]; then + echo "Starting browser '$BROWSER'." + if [ "${DR_DOCKER_STYLE,,}" == "swarm" ]; then + sleep 5 + fi + echo Launching $BROWSER "http://127.0.0.1:${DR_WEBVIEWER_PORT}" + $BROWSER "http://127.0.0.1:${DR_WEBVIEWER_PORT}" & +fi + +CURRENT_CONTAINER_HASH=$(docker ps | grep dr_viewer | head -c 12) + +IP_ADDRESSES="$(hostname -I)" +echo "The viewer will avaliable on the following hosts after initialization:" +for ip in $IP_ADDRESSES; do + echo "http://${ip}:${PORT}" +done diff --git a/scripts/viewer/stop.sh b/scripts/viewer/stop.sh new file mode 100755 index 00000000..a615f452 --- /dev/null +++ b/scripts/viewer/stop.sh @@ -0,0 +1,14 @@ +#!/usr/bin/env bash + +STACK_NAME="deepracer-$DR_RUN_ID-viewer" +COMPOSE_FILES=$DR_DIR/docker/docker-compose-webviewer.yml + +# Check if we will use Docker Swarm or Docker Compose +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + docker stack rm $STACK_NAME +else + export DR_VIEWER_HTML=$DR_DIR/tmp/streams-$DR_RUN_ID.html + export DR_NGINX_CONF=$DR_DIR/tmp/streams-$DR_RUN_ID.conf + + docker compose -f $COMPOSE_FILES -p $STACK_NAME down +fi diff --git a/utils/Dockerfile.gpu-detect b/utils/Dockerfile.gpu-detect new file mode 100644 index 00000000..ae9ae334 --- /dev/null +++ b/utils/Dockerfile.gpu-detect @@ -0,0 +1,4 @@ +FROM nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04 +RUN apt-get update && apt-get install -y --no-install-recommends wget python3 +RUN wget https://gist.githubusercontent.com/f0k/63a664160d016a491b2cbea15913d549/raw/f25b6b38932cfa489150966ee899e5cc899bf4a6/cuda_check.py +CMD ["python3","cuda_check.py"] \ No newline at end of file diff --git a/utils/cuda-check-tf.py b/utils/cuda-check-tf.py new file mode 100644 index 00000000..b3360ca8 --- /dev/null +++ b/utils/cuda-check-tf.py @@ -0,0 +1,10 @@ +from tensorflow.python.client import device_lib +import tensorflow as tf + +def get_available_gpus(): + local_device_protos = device_lib.list_local_devices() + return [x.name for x in local_device_protos if x.device_type == 'GPU'] + +gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.05) +sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) +print(get_available_gpus()) diff --git a/utils/cuda-check.sh b/utils/cuda-check.sh new file mode 100755 index 00000000..d9390757 --- /dev/null +++ b/utils/cuda-check.sh @@ -0,0 +1,5 @@ +#!/usr/bin/env bash + +CONTAINER_ID=$(docker create --rm -ti -e CUDA_VISIBLE_DEVICES --name cuda-check awsdeepracercommunity/deepracer-robomaker:$DR_ROBOMAKER_IMAGE "python3 cuda-check-tf.py") +docker cp $DR_DIR/utils/cuda-check-tf.py $CONTAINER_ID:/opt/install/ +docker start -a $CONTAINER_ID diff --git a/utils/download-car-model.py b/utils/download-car-model.py new file mode 100755 index 00000000..2aeb3ba4 --- /dev/null +++ b/utils/download-car-model.py @@ -0,0 +1,130 @@ +#!/usr/bin/env python3 +""" +This script checks for model files in an S3 bucket, downloads, and renames them based on a specified pattern. + +Environment Variables: +- DR_LOCAL_S3_BUCKET: Name of the S3 bucket. +- DR_LOCAL_S3_PROFILE: AWS profile name for boto3 session. +- DR_REMOTE_MINIO_URL: (Optional) MinIO server URL. + +Usage: + python download-car-model.py --pattern +""" + +import boto3 +import os +import fnmatch +import argparse + +# Load environment variables +bucket_name = os.getenv('DR_LOCAL_S3_BUCKET') +profile_name = os.getenv('DR_LOCAL_S3_PROFILE') +minio_url = os.getenv('DR_REMOTE_MINIO_URL') + +# Set up boto3 session with the specified profile +session = boto3.Session(profile_name=profile_name) +endpoint_url = minio_url if minio_url else None +s3 = session.client('s3', endpoint_url=endpoint_url) + +def check_model_file(prefix): + """ + Check if a model.tar.gz file exists in the specified prefix. + + Args: + prefix (str): The prefix to check within the S3 bucket. + + Returns: + bool: True if the model file is found, False otherwise. + """ + try: + response = s3.list_objects_v2(Bucket=bucket_name, Prefix=f"{prefix}output/") + for obj in response.get('Contents', []): + if obj['Key'].endswith('model.tar.gz'): + print(f"Found model.tar.gz in {prefix}output/") + return f"{obj['Key']}" + + response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix) + for obj in response.get('Contents', []): + if obj['Key'].endswith('carfile.tar.gz'): + print(f"Found carfile.tar.gz in {prefix}") + return f"{obj['Key']}" + + print(f"No model.tar.gz found in {prefix}output/ and no carfile.tar.gz found in {prefix}") + return None + except Exception as e: + print(f"Error checking {prefix}: {e}") + return None + +def list_matching_prefixes(bucket_name, prefix_pattern): + """ + List all prefixes in the specified S3 bucket that match the given pattern. + + Args: + bucket_name (str): The name of the S3 bucket. + prefix_pattern (str): The pattern to match prefixes against. + + Returns: + list: A list of matching prefixes. + """ + try: + response = s3.list_objects_v2(Bucket=bucket_name, Delimiter='/') + prefixes = [prefix['Prefix'] for prefix in response.get('CommonPrefixes', [])] + matching_prefixes = fnmatch.filter(prefixes, prefix_pattern) + return matching_prefixes + except Exception as e: + print(f"Error listing prefixes: {e}") + return [] + +def download_and_rename_model_file(prefix, file_key, output_folder="."): + """ + Download and rename the model.tar.gz file from the specified file key. + + Args: + prefix (str): The prefix of the model file. + file_key (str): The S3 key of the model file to download. + output_folder (str): The folder where the downloaded file should be placed. Defaults to the current directory. + + Returns: + bool: True if the model file is downloaded and renamed, False otherwise. + """ + try: + if not os.path.exists(output_folder): + os.makedirs(output_folder) + local_filename = os.path.join(output_folder, f"{prefix.rstrip('/')}.tar.gz") + s3.download_file(bucket_name, file_key, local_filename) + print(f"Downloaded and renamed {file_key} to {local_filename}") + return True + except Exception as e: + print(f"Error downloading {file_key}: {e}") + return False + +def validate_s3_connection(): + """ + Validate the S3 connection using the provided bucket name and profile name. + + Raises: + ValueError: If bucket name or profile name is not defined. + ConnectionError: If unable to connect to the S3 bucket. + """ + if not bucket_name or not profile_name: + raise ValueError("Bucket name and profile name must be defined in environment variables.") + + try: + s3.head_bucket(Bucket=bucket_name) + print(f"Successfully connected to bucket: {bucket_name}") + except Exception as e: + raise ConnectionError(f"Unable to connect to the bucket: {e}") + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='Check and download model files from S3.') + parser.add_argument('--pattern', type=str, required=True, help='Pattern for prefixes to check') + parser.add_argument('--output_folder', type=str, default='.', help='Folder to store downloaded files') + args = parser.parse_args() + + validate_s3_connection() + + matching_prefixes = list_matching_prefixes(bucket_name, args.pattern) + for prefix in matching_prefixes: + model_file_path = check_model_file(prefix) + if model_file_path: + download_and_rename_model_file(prefix, model_file_path, args.output_folder) \ No newline at end of file diff --git a/utils/evaluate.sh b/utils/evaluate.sh new file mode 100755 index 00000000..9c06bcbd --- /dev/null +++ b/utils/evaluate.sh @@ -0,0 +1,54 @@ +#!/bin/bash + +# This script evaluates DeepRacer models by managing the evaluation process. +# It requires one argument: the path to the environment configuration file. +# The script sources environment variables from the specified file, then: +# 1. Validates the existence of the environment file. +# 2. Sources the activate.sh script to set up necessary environment variables. +# 3. Prints the evaluation configuration, including Run ID, Model Name, and Track. +# 4. Executes the evaluation process by stopping any ongoing evaluation, and starts a new evaluation. + +# To run this script every 3 minutes using crontab, follow these steps: +# 1. Open the crontab editor by executing `crontab -e` in your terminal. +# 2. Add the following line to schedule the script: +# `*/3 * * * * /utils/evaluate.sh run.env >> /evaluate.log 2>&1` +# 3. Save and close the editor. The script is now scheduled to run every 3 minutes. + +if [ "$#" -ne 1 ]; then + echo "Usage: $0 " + exit 1 +fi + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" +DR_DIR="$(dirname $SCRIPT_DIR)" +ENV_FILE="$1" + +if [[ -f "$DR_DIR/$ENV_FILE" ]]; then + source $DR_DIR/bin/activate.sh $DR_DIR/$ENV_FILE +else + echo "File $ENV_FILE does not exist." + exit 1 +fi + +printf "\n##################################################\n" +printf "### %-15s %-15s\n" "Configuration:" "$ENV_FILE" +printf "### %-15s %-15s\n" "Run ID:" "$DR_RUN_ID" +printf "### %-15s %-15s\n" "Model Name:" "$DR_LOCAL_S3_MODEL_PREFIX" +printf "### %-15s %-15s\n" "Track:" "$DR_WORLD_NAME" +printf "### %-15s %-15s\n" "Start:" "$(date)" +printf "##################################################\n\n" + +dr-stop-evaluation + +# Check if Docker style is set to swarm and wait for all containers to stop +if [ "$DR_DOCKER_STYLE" == "swarm" ]; then + STACK_NAME="deepracer-eval-$DR_RUN_ID" + STACK_CONTAINERS=$(docker stack ps $STACK_NAME 2>/dev/null | wc -l) + while [[ "$STACK_CONTAINERS" -gt 1 ]]; do + echo "Waiting for all containers in the stack to stop..." + sleep 5 + STACK_CONTAINERS=$(docker stack ps $STACK_NAME 2>/dev/null | wc -l) + done +fi + +dr-start-evaluation -q diff --git a/utils/sample-createspot.sh b/utils/sample-createspot.sh new file mode 100644 index 00000000..564174e2 --- /dev/null +++ b/utils/sample-createspot.sh @@ -0,0 +1,141 @@ +#!/usr/bin/env bash + +## This is sample code that will generally show you how to launch a spot instance on aws and leverage the +## automation built into deepracer-for-cloud to automatically start training +## Changes required to work: +## Input location where your training will take place -- S3_LOCATION +## Input security group, iam role, and key-name + +## First you need to tell the script where in s3 your training will take place +## can be either a bucket at the root level, or a bucket/prefix. don't include the s3:// + +S3_LOCATION=<#########> + +## extract bucket location +BUCKET=${S3_LOCATION%%/*} + +## extract prefix location +if [[ "$S3_LOCATION" == *"/"* ]] +then + PREFIX=${S3_LOCATION#*/} +else + PREFIX="" +fi + +## Fill these out with your custom information if you want to upload and submit to leaderboard. not required to run +DR_UPLOAD_S3_PREFIX=######## + +## set the instance type you want to launch +INSTANCE_TYPE=c5.2xlarge + +## if you want to modify additional variables from the default, add them here, then add them to section further below called replace static paramamters. I've only done World name for now +WORLD_NAME=FS_June2020 + +## modify this if you want additional robomaker workers +DR_WORKERS=1 + +## select which images you want to use. these will be used later for a docker pull +DR_SAGEMAKER_IMAGE=cpu-avx-mkl +DR_ROBOMAKER_IMAGE=cpu-avx2 + +## check the s3 location for existing training folders +## automatically determine the latest training run (highest number), and set model parameters accordingly +## this script assumes the format rl-deepracer-1, rl-deepracer-2, etc. you will need to modify if your schema differs + +LAST_TRAINING=$(aws s3 ls $S3_LOCATION/rl-deepracer | sort -t - -k 3 -g | tail -n 1 | awk '{print $2}') +## drop trailing slash +LAST_TRAINING=$(echo $LAST_TRAINING | sed 's:/*$::') + +CONFIG_FILE="./run.env" +OLD_SYSTEMENV="./system.env" + +## incorporate logic from increment.sh, slightly modified to use last training +OPT_DELIM='-' +## Read in data +CURRENT_RUN_MODEL=$(aws s3 ls $S3_LOCATION/rl-deepracer | sort -t - -k 3 -g | tail -n 1 | awk '{print $2}') +## drop trailing slash +CURRENT_RUN_MODEL=$(echo $LAST_TRAINING | sed 's:/*$::') +## get number at the end +CURRENT_RUN_MODEL_NUM=$(echo "${CURRENT_RUN_MODEL}" | \ + awk -v DELIM="${OPT_DELIM}" '{ n=split($0,a,DELIM); if (a[n] ~ /[0-9]*/) print a[n]; else print ""; }') + +if [ -z $LAST_TRAINING ] +then + echo No prior training found + if [[ $PREFIX == "" ]] + then + NEW_RUN_MODEL=rl-deepracer-1 + else + NEW_RUN_MODEL="$PREFIX/rl-deepracer-1" + fi + PRETRAINED=False + CURRENT_RUN_MODEL=$NEW_RUN_MODEL +else + + NEW_RUN_MODEL_NUM=$(echo "${CURRENT_RUN_MODEL_NUM} + 1" | bc ) + PRETRAINED=True + + if [[ $PREFIX == "" ]] + then + NEW_RUN_MODEL=$(echo $CURRENT_RUN_MODEL | sed "s/${CURRENT_RUN_MODEL_NUM}\$/${NEW_RUN_MODEL_NUM}/") + else + NEW_RUN_MODEL=$(echo $CURRENT_RUN_MODEL | sed "s/${CURRENT_RUN_MODEL_NUM}\$/${NEW_RUN_MODEL_NUM}/") + NEW_RUN_MODEL="$PREFIX/$NEW_RUN_MODEL" + CURRENT_RUN_MODEL="$PREFIX/$CURRENT_RUN_MODEL" + fi + echo Last training was $CURRENT_RUN_MODEL so next training is $NEW_RUN_MODEL +fi + +if [[ $PREFIX == "" ]] +then + CUSTOM_FILES_PREFIX="custom_files" +else + CUSTOM_FILES_PREFIX="$PREFIX/custom_files" +fi + +## Replace dynamic parameters in run.env (still local to your directory) +sed -i.bak -re "s:(DR_LOCAL_S3_PRETRAINED_PREFIX=).*$:\1$CURRENT_RUN_MODEL:g; s:(DR_LOCAL_S3_PRETRAINED=).*$:\1$PRETRAINED:g; s:(DR_LOCAL_S3_MODEL_PREFIX=).*$:\1$NEW_RUN_MODEL:g; s:(DR_LOCAL_S3_CUSTOM_FILES_PREFIX=).*$:\1$CUSTOM_FILES_PREFIX:g" "$CONFIG_FILE" +sed -i.bak -re "s/(DR_LOCAL_S3_BUCKET=).*$/\1$BUCKET/g" "$CONFIG_FILE" + +## Replace static parameters in run.env (still local to your directory) +sed -i.bak -re "s/(DR_UPLOAD_S3_PREFIX=).*$/\1$DR_UPLOAD_S3_PREFIX/g" "$CONFIG_FILE" +sed -i.bak -re "s/(DR_WORLD_NAME=).*$/\1$WORLD_NAME/g" "$CONFIG_FILE" + +## Replace static paramaters in system.env file, including sagemaker and robomaker images (still local to your directory) and the number of DR_workers +sed -i.bak -re "s/(DR_UPLOAD_S3_BUCKET=).*$/\1$DR_UPLOAD_S3_BUCKET/g; s/(DR_SAGEMAKER_IMAGE=).*$/\1$DR_SAGEMAKER_IMAGE/g; s/(DR_ROBOMAKER_IMAGE=).*$/\1$DR_ROBOMAKER_IMAGE/g; s/(DR_WORKERS=).*$/\1$DR_WORKERS/g" "$OLD_SYSTEMENV" + +## upload the new run.env and system.env files into your S3 bucket (same s3 location identified earlier) +## files are loaded into the node-config folder/prefix. You can also upload other files to node config, and they +## will sync to the EC2 instance as part of the autorun script later. If you add other files, make sure they are +## in node-config in the same directory structure as DRfc; example: s3location/node-config/scripts/training/.start.sh +RUNENV_LOCATION=$S3_LOCATION/node-config/run.env +SYSENV_LOCATION=$S3_LOCATION/node-config/system.env + +aws s3 cp ./run.env s3://$RUNENV_LOCATION +aws s3 cp ./system.env s3://$SYSENV_LOCATION + +## upload a custom autorun script to S3. there is a default autorun script in the repo that will be used unless a custom one is specified here instead +#aws s3 cp ./autorun.sh s3://$S3_LOCATION/autorun.sh + +## upload custom files -- if you dont want this, comment these lines out +aws s3 cp ./model_metadata.json s3://$S3_LOCATION/custom_files/model_metadata.json +aws s3 cp ./reward_function.py s3://$S3_LOCATION/custom_files/reward_function.py +aws s3 cp ./hyperparameters.json s3://$S3_LOCATION/custom_files/hyperparameters.json + +## launch an ec2 +## update with your own settings, including key-name, security-group, and iam-instance-profile at a minimum +## user data includes a command to create a .txt file which simply contains the name of the s3 location +## this filename will be used as fundamental input to autorun.sh script run later on that instance +## you need to ensure you have proper IAM permissions to launch this instance + +aws ec2 run-instances \ + --image-id ami-085925f297f89fce1 \ + --count 1 \ + --instance-type $INSTANCE_TYPE \ + --key-name <####keyname####> \ + --security-group-ids sg-<####sgid####> \ + --block-device-mappings 'DeviceName=/dev/sda1,Ebs={DeleteOnTermination=true,VolumeSize=40}' \ + --iam-instance-profile Arn=arn:aws:iam::<####acct_num####>:instance-profile/<####role_name####> \ + --instance-market-options MarketType=spot \ + --user-data "#!/bin/bash + su -c 'git clone https://github.com/aws-deepracer-community/deepracer-for-cloud.git && echo "$S3_LOCATION/node-config" > /home/ubuntu/deepracer-for-cloud/autorun.s3url && /home/ubuntu/deepracer-for-cloud/bin/prepare.sh' - ubuntu" diff --git a/utils/setup-xorg.sh b/utils/setup-xorg.sh new file mode 100755 index 00000000..c00a2171 --- /dev/null +++ b/utils/setup-xorg.sh @@ -0,0 +1,35 @@ +#!/bin/bash + +set -e + +# Script to install basic X-Windows on a headless instance (e.g. in EC2) + +# Script shall run as user, not root. Sudo will be used when needed. +if [[ $EUID == 0 ]]; then + echo "ERROR: Do not run as root / via sudo." + exit 1 +fi + +# Deepracer environment variables must be set. +if [ -z "$DR_DIR" ]; then + echo "ERROR: DR_DIR not set. Run 'source bin/activate.sh' before setup-xorg.sh." + exit 1 +fi + +# Install additional packages +sudo apt-get install xinit xserver-xorg-legacy x11-xserver-utils x11-utils \ + menu mesa-utils xterm mwm x11vnc pkg-config screen -y --no-install-recommends + +# Configure +sudo sed -i -e "s/console/anybody/" /etc/X11/Xwrapper.config +BUS_ID=$(nvidia-xconfig --query-gpu-info | grep "PCI BusID" | cut -f2- -d: | sed -e 's/^[[:space:]]*//' | head -1) +sudo nvidia-xconfig --busid=$BUS_ID -o $DR_DIR/tmp/xorg.conf + +touch ~/.Xauthority + +sudo tee -a $DR_DIR/tmp/xorg.conf <&2 + usage + ;; + esac +done + +FILE=$DR_DIR/tmp/streams-$DR_RUN_ID.html + +# Check if we will use Docker Swarm or Docker Compose +if [[ "${DR_DOCKER_STYLE,,}" == "swarm" ]]; then + echo "This script does not support swarm mode. Use $(dr-start-viewer)." + exit +fi + +echo "DR-$DR_RUN_ID - $DR_LOCAL_S3_MODEL_PREFIX - $TOPIC

DR-$DR_RUN_ID - $DR_LOCAL_S3_MODEL_PREFIX - $TOPIC

" >$FILE + +ROBOMAKER_CONTAINERS=$(docker ps --format "{{.ID}}" --filter name=deepracer-$DR_RUN_ID --filter "ancestor=${DR_SIMAPP_SOURCE}:${DR_SIMAPP_VERSION}") +if [ -z "$ROBOMAKER_CONTAINERS" ]; then + echo "No running robomakers. Exiting." + exit +fi + +for c in $ROBOMAKER_CONTAINERS; do + C_PORT=$(docker inspect $c | jq -r '.[0].NetworkSettings.Ports["8080/tcp"][0].HostPort') + C_URL="http://localhost:${C_PORT}/stream?topic=${TOPIC}&quality=${QUALITY}&width=${WIDTH}&height=${HEIGHT}" + C_IMG="" + echo $C_IMG >>$FILE +done + +echo "" >>$FILE +echo "Starting browser '$BROWSER'." +$BROWSER $(readlink -f $FILE) & diff --git a/utils/start-xorg.sh b/utils/start-xorg.sh new file mode 100755 index 00000000..49b5686c --- /dev/null +++ b/utils/start-xorg.sh @@ -0,0 +1,47 @@ +#!/bin/bash + +set -e + +# Script shall run as user, not root. Sudo will be used when needed. +if [[ $EUID == 0 ]]; then + echo "ERROR: Do not run as root / via sudo." + exit 1 +fi + +# X must not be running when we try to start it. +if timeout 1s xset -display $DR_DISPLAY q &>/dev/null; then + echo "ERROR: X Server already running on display $DR_DISPLAY." + exit 1 +fi + +# Deepracer environment variables must be set. +if [ -z "$DR_DIR" ]; then + echo "ERROR: DR_DIR not set. Run 'source bin/activate.sh' before start-xorg.sh." + exit 1 +fi + +if [ -z "$DR_DISPLAY" ]; then + echo "ERROR: DR_DISPLAY not set. Ensure the variable is configured in system.env." + exit 1 +fi + +# Start inside a sudo-screen to prevent it from stopping when disconnecting terminal. +sudo screen -d -S DeepracerXorg -m bash -c "xinit /usr/bin/mwm -display $DR_DISPLAY -- /usr/lib/xorg/Xorg $DR_DISPLAY -config $DR_DIR/tmp/xorg.conf > $DR_DIR/tmp/xorg.log 2>&1" + +# Screen detaches; let it have some time to start X. +sleep 1 + +if [[ "${DR_GUI_ENABLE,,}" == "true" ]]; then + x11vnc -bg -forever -no6 -nopw -rfbport 5901 -rfbportv6 -1 -loop -display WAIT$DR_DISPLAY & + sleep 1 +fi + +# Create xauth mit-magic-cookie. +xauth generate $DR_DISPLAY + +# Check if X started successfully. If not, print error message and exit. +if timeout 1s xset -display $DR_DISPLAY q &>/dev/null; then + echo "X Server started on display $DR_DISPLAY" +else + echo "Server failed to start on display $DR_DISPLAY" +fi diff --git a/utils/timed-stop.sh b/utils/timed-stop.sh new file mode 100755 index 00000000..f53c93fb --- /dev/null +++ b/utils/timed-stop.sh @@ -0,0 +1,8 @@ +#!/usr/bin/env bash + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" +DR_DIR="$(dirname $SCRIPT_DIR)" +ENV_FILE="$1" + +source $DR_DIR/bin/activate.sh $DR_DIR/$1 +dr-stop-training \ No newline at end of file diff --git a/utils/upload-rotate.sh b/utils/upload-rotate.sh new file mode 100755 index 00000000..5c706081 --- /dev/null +++ b/utils/upload-rotate.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# This script uploads the latest DeepRacer model and activates the necessary environment. +# It processes command line options to customize the environment file path, enable local upload, and specify an evaluation environment file. +# After processing the options, it activates the environment, uploads the model, and updates the evaluation environment file with the new model prefix if specified. +# +# Usage: +# ./upload-rotate.sh [-e ] [-L] [-E ] [-c ] [-v] +# +# Options: +# -c Specify the path to the counter file. This is optional. +# -e Specify the path to the environment configuration file. Defaults to 'run.env' in the script's directory. +# -L Enable local upload. This option does not require a value. +# -v Add more verbose logging, capturing iteration and entropy numbers. +# -E Specify the path to the evaluation environment configuration file. This is optional. +# -C Upload the car file. This option does not require a value. +# +# Example: +# ./upload-rotate.sh -e custom.env -L -E eval.env +# +# To run this script manually, navigate to its directory and execute it with desired options. +# Ensure you have the necessary permissions to execute the script. + +# Navigate to the script directory +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +DR_DIR="$(dirname "$SCRIPT_DIR")" + +# Default environment file path +ENV_FILE="$DR_DIR/run.env" +LOCAL_UPLOAD="" +EVAL_ENV_FILE="" + +# Process command line options +while getopts "e:LE:vc:C" opt; do + case $opt in + c) COUNTER_FILE="$OPTARG" ;; + e) ENV_FILE="$OPTARG" ;; + L) LOCAL_UPLOAD="-L" ;; + E) EVAL_ENV_FILE="$OPTARG" ;; + v) VERBOSE_LOGGING="true" ;; + C) CAR_FILE="-C" ;; + *) echo "Invalid option: -$OPTARG" >&2; exit 1 ;; + esac +done + +# If a counter file is specified, increment the counter +if [ -n "$COUNTER_FILE" ]; then + if [ -f "$COUNTER_FILE" ]; then + COUNTER=$(cat "$COUNTER_FILE") + COUNTER=$((COUNTER + 1)) + echo "$COUNTER" > "$COUNTER_FILE" + export UPLOAD_COUNTER=$COUNTER + else + echo "Error: Counter file '$COUNTER_FILE' not found." >&2 + exit 1 + fi +fi + +# Activate the environment +if [ -f "$ENV_FILE" ]; then + source "$DR_DIR/bin/activate.sh" "$ENV_FILE" +else + if [ -f "$DR_DIR/$ENV_FILE" ]; then + source "$DR_DIR/bin/activate.sh" "$DR_DIR/$ENV_FILE" + else + echo "Error: Environment file '$ENV_FILE' not found." >&2 + exit 1 + fi +fi + +# Execute the upload command +if [ -n "$COUNTER_FILE" ]; then + dr-upload-model $LOCAL_UPLOAD -f +else + dr-upload-model $LOCAL_UPLOAD -1 -f +fi +dr-update + +# If the car file option is specified, upload the car file +if [ -n "$CAR_FILE" ]; then + dr-upload-car-zip $LOCAL_UPLOAD -f +fi + +# If an evaluation environment file is specified then alter the model prefix to enable evaluation +if [ -n "$EVAL_ENV_FILE" ]; then + if [ ! -f "$EVAL_ENV_FILE" ]; then + if [ -f "$DR_DIR/$EVAL_ENV_FILE" ]; then + EVAL_ENV_FILE="$DR_DIR/$EVAL_ENV_FILE" + else + echo "Error: Evaluation environment file '$EVAL_ENV_FILE' not found." >&2 + exit 1 + fi + fi + MODEL_PREFIX=$(echo $DR_UPLOAD_S3_PREFIX) + echo "Updating evaluation environment file $EVAL_ENV_FILE to use $MODEL_PREFIX" + sed -i "s/DR_LOCAL_S3_MODEL_PREFIX=.*/DR_LOCAL_S3_MODEL_PREFIX=$MODEL_PREFIX/" $EVAL_ENV_FILE +fi + +printf "\n############################################################\n" +printf "### %-15s %-15s\n" "Configuration:" "$ENV_FILE" +printf "### %-15s %-15s\n" "Model Name:" "$DR_LOCAL_S3_MODEL_PREFIX" +printf "### %-15s %-15s\n" "Uploaded Model:" "$DR_UPLOAD_S3_PREFIX" + +# If verbose logging is enabled, retrieve the entropy and iteration numbers. +if [ -n "$VERBOSE_LOGGING" ]; then + CONTAINER_ID=$(docker ps -f "name=deepracer-${DR_RUN_ID}_rl_coach" --format "{{.ID}}") + if [ -n "$CONTAINER_ID" ]; then + LAST_ITERATION=$(docker logs --since 20m "$CONTAINER_ID" 2>/dev/null | awk '{if (match($0, /Best checkpoint number: ([0-9]+), Last checkpoint number: ([0-9]+)/, arr)) {print arr[2]}}' | tail -n 1) + printf "### %-15s %-15s\n" "Last iteration:" "$LAST_ITERATION" + + ENTROPY=$(docker logs --since 20m "$CONTAINER_ID" 2>/dev/null | awk '{if (match($0, /Entropy=([0-9.]+)/, arr)) {print arr[1]}}' | tail -n 1) + printf "### %-15s %-15s\n" "Entropy:" "$ENTROPY" + fi +fi + +printf "### %-15s %-15s\n" "Completed at:" "$(date)" +printf "############################################################\n\n"