diff --git a/baselines/diabetic_retinopathy_detection/README.md b/baselines/diabetic_retinopathy_detection/README.md
index b042b0611..21998e73f 100644
--- a/baselines/diabetic_retinopathy_detection/README.md
+++ b/baselines/diabetic_retinopathy_detection/README.md
@@ -27,91 +27,55 @@ Bayesian deep learning seeks to equip deep neural networks with the ability to p
 Set up and activate the Python environment by executing
 
 ```
-conda create -n ub python=3.8
+conda create -n ub python=3.10
 conda activate ub
 python3 -m pip install -e .[models,jax,tensorflow,torch,retinopathy]  # In uncertainty-baselines root directory
-pip install "git+https://github.com/google-research/robustness_metrics.git#egg=robustness_metrics"
-pip install 'git+https://github.com/google/edward2.git'
 ```
 
 ## Data Installation
 
-Because the data is distributed through Kaggle, we need to take a manual route to downloading.
+The below instructions will install and preprocess the data needed to train and evaluate models on the Country and Severity Shifts.
 
-1. Download from Kaggle: https://www.kaggle.com/c/diabetic-retinopathy-detection
+The EyePACS dataset (used in Country and Severity Shift) and the APTOS 2019 dataset (used in Country Shift) are distributed through Kaggle, which requires us to manually download the data and place it in the correct directory.
 
-2. Extract everything to ``$DATA_DIR/downloads/manual``; your directory should look like
+1. Download the raw datasets from Kaggle:
+   * [EyePACS](https://www.kaggle.com/c/diabetic-retinopathy-detection)
+   * [APTOS 2019](https://www.kaggle.com/c/aptos2019-blindness-detection)
 
-``sample/  sampleSubmission.csv  test/  train/  trainLabels.csv``
+2. Extract the EyePACS dataset to ``$DATA_DIR/ub_diabetic_retinopathy_detection/manual``. The directory structure should look like this:
 
-3. Confirm successful download of files
+    ``sample/  sampleSubmission.csv  test/  train/  trainLabels.csv``
 
-You should have 35,126 training images and 53,576 test images, which should be located in manual/train and manual/test.
+3. Extract the APTOS dataset to ``$DATA_DIR/aptos/manual``. The directory structure should look like this:
 
-You may check this with the command 
+    ``sample_submission.csv  test.csv  test_images/  train.csv  train_images/``
 
-`ls -1 | wc -l`
+4. Confirm successful download of files. The following commands should print out the number of files in the directories:
+    ```
+    $ ls -1 $DATA_DIR/ub_diabetic_retinopathy_detection/manual/train | wc -l
+    35126
+    $ ls -1 $DATA_DIR/ub_diabetic_retinopathy_detection/manual/test | wc -l
+    53576
+    $ ls -1 $DATA_DIR/aptos/manual/train_images | wc -l
+    3662
+    $ ls -1 $DATA_DIR/aptos/manual/test_images | wc -l
+    1928
+    ```
 
-4. Manual loading -- this is not contained in standard execution of diabetic-retinopathy model execution (yet)
+5. Manual shuffling and packaging of TF dataset objects. Consider running the following commands in a `tmux` or `screen` session, in case of a network failure. They take a while. 
+    ```
+    conda activate ub
+    
+    # --prepare_mode=all will load the EyePACS, Country Shift (APTOS 2019), and Severity Shift datasets.
+    python baselines/diabetic_retinopathy_detection/prepare_retina_data.py --data_dir=$DATA_DIR --prepare_mode=all
+    ```
 
-I suggest doing this loading in a `screen` session, in case it fails -- it takes a while. 
-
-I suggest doing this in an ipython shell!
-
-``$ ipython``
-
-**Train loading:**
-
-First, we initialize a DiabeticRetinopathyDetectionDataset object.
-
-```
-import uncertainty_baselines as ub
-
-data_dir = $DATA_DIR
-
-dataset_train_builder = ub.datasets.get(
-    "ub_diabetic_retinopathy_detection",
-    split='train',
-    data_dir=data_dir, download_data=True)
-```
-
-We then need to shuffle and package our data into TF objects:
-
-```
-dataset_train_builder._dataset_builder.download_and_prepare(download_dir=f'{data_dir}/downloads/')
-```
-
-Rinse and repeat for test data:
-
-```
-dataset_test_builder = ub.datasets.get(
-    "ub_diabetic_retinopathy_detection",
-    split='test',
-    data_dir=data_dir, download_data=True)
-dataset_test_builder._dataset_builder.download_and_prepare(download_dir=f'{data_dir}/downloads/')
-```
-
-**Install / Download for Severity and Country Shifts**
-
-Severity Shift depends on precisely the same data as the original Diabetic Retinopathy dataset, so we do not need to go back to step 1.
-
-We can package the Severity Shift splits into TF objects by substituting "ub_diabetic_retinopathy_detection" with "diabetic_retinopathy_severity_shift_mild", and using the following arguments for `split`:
- ```
-train
-in_domain_validation
-ood_validation
-in_domain_test
-ood_test
-```
-
-On the other hand, to download the (much smaller) APTOS dataset, we do need to repeat steps from step 1, downloading from https://www.kaggle.com/c/aptos2019-blindness-detection. Note that APTOS only includes "validation" and "test" splits.
-
-**Additional Splits for Exploration**
+### Additional Splits for Exploration
 
 There are several additional splits available for experimenting with other partitions of the severity levels into binary classification, and with other preprocessing configurations. 
 See the following files for details on available splits:
 *  [uncertainty_baselines/datasets/diabetic_retinopathy_detection.py](uncertainty_baselines/datasets/diabetic_retinopathy_detection.py): standard EyePACS dataset
-*  [uncertainty_baselines/datasets/diabetic_retinopathy_severity_shift_mild.py](uncertainty_baselines/datasets/diabetic_retinopathy_severity_shift_mild.py): Severity Shift with the binary decision threshold between no and mild DR, and {moderate, severe, proliferative} DR as out-of-distribution
+*  [uncertainty_baselines/datasets/diabetic_retinopathy_severity_shift_mild.py](uncertainty_baselines/datasets/diabetic_retinopathy_severity_shift_mild.py): Severity Shift with the binary decision threshold between no and mild DR, and {moderate, severe, proliferative} DR as out-of-distribution (used in the paper)
 *  [uncertainty_baselines/datasets/diabetic_retinopathy_severity_shift_moderate.py](uncertainty_baselines/datasets/diabetic_retinopathy_severity_shift_moderate.py): Severity Shift with the binary decision threshold between mild and moderate DR, and {severe, proliferative} DR as out-of-distribution
 *  [uncertainty_baselines/datasets/aptos.py](uncertainty_baselines/datasets/aptos.py): APTOS distributionally shifted evaluation dataset, partitioned into "validation" and "test" splits
 
diff --git a/baselines/diabetic_retinopathy_detection/prepare_retina_data.py b/baselines/diabetic_retinopathy_detection/prepare_retina_data.py
new file mode 100644
index 000000000..0b7dbad80
--- /dev/null
+++ b/baselines/diabetic_retinopathy_detection/prepare_retina_data.py
@@ -0,0 +1,67 @@
+"""
+Loads and packages the data for the RETINA Benchmark.
+"""
+
+from absl import app
+from absl import flags
+from absl import logging
+
+import uncertainty_baselines as ub
+
+flags.DEFINE_string(
+  'data_dir', None, 
+  'Path to data folder, which contain subfolders called '
+  '`ub_diabetic_retinopathy_detection` and `aptos`, containing the raw data for '
+  'EyePACS and APTOS 2019 respectively. See README.md for further information.')
+flags.DEFINE_string(
+  'prepare_mode', 'all', 'Determine which dataset(s) to prepare.')
+flags.register_validator(
+  'prepare_mode',
+  lambda value: value in ['all', 'eyepacs', 'aptos', 'severity'],
+  message='--prepare_mode must be one of [all, eyepacs, aptos, severity].')
+FLAGS = flags.FLAGS
+
+# Supported datasets.
+_UB_DIABETIC_RETINOPATHY_DETECTION ="ub_diabetic_retinopathy_detection"
+_APTOS = "aptos"
+_DIABETIC_RETINOPATHY_SEVERITY_SHIFT_MILD = (
+  "diabetic_retinopathy_severity_shift_mild")
+
+# Splits for each dataset.
+_SPLITS = {
+  _UB_DIABETIC_RETINOPATHY_DETECTION: ['train', 'test'],
+  _APTOS: ['validation', 'test'],
+  _DIABETIC_RETINOPATHY_SEVERITY_SHIFT_MILD: [
+    'train', 'in_domain_validation',
+    'ood_validation', 'in_domain_test', 'ood_test']
+}
+
+_DATASET_NAMES_BY_MODE = {
+  'all': list(_SPLITS.keys()),
+  'eyepacs': [_UB_DIABETIC_RETINOPATHY_DETECTION],
+  'aptos': [_APTOS],
+  'severity': [_DIABETIC_RETINOPATHY_SEVERITY_SHIFT_MILD]
+}
+
+def _download_and_prepare_dataset(
+    dataset_name: str,
+    split: str,
+    data_dir: str
+) -> None:
+  builder = ub.datasets.get(dataset_name=dataset_name, split=split,
+                            data_dir=data_dir, download_data=True)
+  builder._dataset_builder.download_and_prepare(
+    download_dir=f'{data_dir}/{dataset_name}/')
+
+def main(argv):
+  del argv  # unused arg
+  data_dir = FLAGS.data_dir
+  dataset_names = _DATASET_NAMES_BY_MODE[FLAGS.prepare_mode]
+  for dataset_name in dataset_names:
+    for split in _SPLITS[dataset_name]:
+      _download_and_prepare_dataset(
+        dataset_name=dataset_name, split=split, data_dir=data_dir)
+      logging.info(f'Finished packaging `{dataset_name}` {split} data.')
+
+if __name__ == '__main__':
+  app.run(main)
diff --git a/baselines/diabetic_retinopathy_detection/utils/eval_utils.py b/baselines/diabetic_retinopathy_detection/utils/eval_utils.py
index eac8ff763..e686b8915 100644
--- a/baselines/diabetic_retinopathy_detection/utils/eval_utils.py
+++ b/baselines/diabetic_retinopathy_detection/utils/eval_utils.py
@@ -542,8 +542,9 @@ def eval_model_numpy(datasets,
       np_input=np_input)
 
   if distribution_shift == 'aptos':
-    # TODO(nband): generalize
-    aptos_metadata_path = 'gs://ub-data/aptos/metadata.csv'
+    aptos_metadata_path = (
+      'gs://gresearch/reliable-deep-learning/data/baselines/'
+      'diabetic_retinopathy_detection/aptos_metadata.csv')
     eval_results['ood_test_balanced'] = compute_rebalanced_aptos_dataset(
         aptos_dataset=eval_results['ood_test'],
         aptos_metadata_path=aptos_metadata_path,
diff --git a/baselines/diabetic_retinopathy_detection/utils/metric_utils.py b/baselines/diabetic_retinopathy_detection/utils/metric_utils.py
index 60699628c..7368682d7 100644
--- a/baselines/diabetic_retinopathy_detection/utils/metric_utils.py
+++ b/baselines/diabetic_retinopathy_detection/utils/metric_utils.py
@@ -207,10 +207,10 @@ def log_epoch_metrics(metrics, eval_results, use_tpu, dataset_splits):
     train_columns = ['Train Loss (NLL+L2)', 'Accuracy', 'AUPRC', 'AUROC']
     train_metrics = ['loss', 'accuracy', 'auprc', 'auroc']
     train_values = [
-        metrics['train/loss'].result(),
-        metrics['train/accuracy'].result() * 100,
-        metrics['train/auprc'].result() * 100,
-        metrics['train/auroc'].result() * 100
+        metrics['train/loss'].result().numpy(),
+        metrics['train/accuracy'].result().numpy() * 100,
+        metrics['train/auprc'].result().numpy() * 100,
+        metrics['train/auroc'].result().numpy() * 100
     ]
     if not use_tpu:
       train_columns.append('ECE')