diff --git a/.gitignore b/.gitignore index 86f36e6..5c2169d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,7 @@ .venv/ __pycache__/ pipeline_config.yml +reports/ config/pipeline_config.yml config/local_pipeline_config.yml config/docker_pipeline_config.yml diff --git a/README.md b/README.md index 4244ca3..6fda121 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,10 @@ -## Eval Coordinator +## Autoeval Coordinator ### Description -The coordinator is the entry point to the evaluation pipeline. It takes a gpkg containing either a polygon or multipolygon geometry and then uses that to run and monitor batch jobs for each step along the evaluation pipeline for all the polygons submitted by the user. +This repository contains an evaluation pipeline that works with HashiCorp Nomad to evaluate HAND generated extent flood inundation maps (FIMs) against benchmark FIMs. It takes a gpkg containing either a polygon or multipolygon geometry and then uses that to run and monitor batch jobs for each step along a FIM evaluation pipeline that evaluates flood scenarios for benchmark sources that intersect the AOI submitted by the user. -The current evaluation pipeline is primarily designed to generate HAND FIM extents or depths and then evaluate these against relevant benchmark sources. +The repository also contains a directory `tools/`, that assists the user in running batches of evaluation pipelines, evaluating the results of a batch, and working with the Nomad API. + +While the current evaluation pipeline is primarily designed to generate HAND FIM extents or depths and then evaluate these against relevant benchmark sources it is possible that more pipelines will be added in the future to allow for evaluations of more types of FIMs or for different types of FIM evaluations. ### Getting Started Locally 1. Create `.env` file @@ -50,9 +52,36 @@ This version can also be adapted to dispatch jobs to non-local Nomad servers. - The HAND version argument allows the user to specify a specific version of HAND to generate extents for. This argument is required. - **Benchmark Source** - This is a string that select which source will be used to evaluate HAND against. For example 'ripple-mip' will be used to select FEMA MIP data produced by ripple. This argument is required. +- **hand_index_path** + - This argument provides the location the HAND index used to spatially query a given set of HAND outputs. The NGWPC hand-index repo contains more information about generating a HAND index for use in an evaluation. - **Date Range** - Certain Benchmark sources contain flood scenarios that have a time component to them. For example high water mark data is associated with the flood event associated with a given survey. This argument allows for filtering a Benchmark source to only return benchmark data within a certain date range. ### Inputs - **AOI** - This input is a geopackage that must contain either a polygon or multipolygon geometry. For every polygon the coordinator will generate a HAND extent and find benchmark data that lies within the polygon for the source selected by the user. The coordinator will then run all the rest of the jobs described in this repository to generate an evaluation for that polygon. + +### Outputs +- **output_path** + This is the directory where the outputs of a pipeline will be written. The outputs written to this directory follow this format (here is synonymous with : + + - /: the unique identifier, or test case, for the category of benchmark data used to generate metrics. For the PI7 ripple eval data this corresponds to a STAC item id in a given benchmark STAC collection. This could also be an ID for an AOI that returns multiple STAC items from the Benchmark STAC when used as a query AOI. In the example output the is the STAC item id “11090202-ble” from the “ble-collection" benchmark stac collection. + - __agg_metrics.csv: aggregated metrics for the test case. + - __logs.txt: test case logs generated by the Pipeline + - __results.json: A file containing metadata and references to written output file locations + - catchment_data_indices/: This directory contains files that point to catchment HAND data for each HAND catchment that will be inundated to compare to the benchmark scenarios being evaluated against + - catchment-.parquet: Files in this directory will be parquet files that contain the UUID assigned to that catchment in the HAND index + - /: This directory name is a shortened reference to the Benchmark STAC collection that the benchmark data for this evaluation was queried from. It is possible for a single AOI or test case to be evaluated against multiple benchmark collections so in some cases there could be multiple directories of this type with each directory containing evaluation results for that benchmark collections + - /: Test case scenario, e.g., “ble-100yr” + - -__agreement.tif: The agreement raster for this scenario + - -__benchmark_mosiac.tif: The mosaiced benchmark raster used as the benchmark raster + - -__flowfile.csv: The merged flowfile used for this scenario + - -__inundate_mosiac.tif: The mosaiced HAND extent used as the candidate raster for this scenario + - -__metrics.csv: A single row CSV containing the metrics for this scenario. These CSV’s are aggregated together along with additional metadata to create the test cases agg_metrics.csv file + - catchment_extents/ + - __.tif: The HAND extents for a single HAND catchment. These are merged together to form the inundate_mosaic.tif for the scenario + + +### Running a batch of pipelines + +The above instructions are for running a single test evaluation pipeline using a local nomad cluster. If you know which HAND outputs you want to evaluate and where its HAND index is located and you have access to the FIM Benchmark STAC this should be sufficient to run single pipelines. This repository also contains functionality for running batches of dozens to thousands of pipelines using either a local Nomad cluster running within the Parallel Works environment or a Nomad cluster deployed to the NGWPC AWS Test account. For more information on running batches please refer to `docs/batch-run-guide-ParallelWorks.md` and `docs/batch-run-guide-AWS-Test.md`. diff --git a/docs/batch-run-guide-AWS-Test.md b/docs/batch-run-guide-AWS-Test.md new file mode 100644 index 0000000..3137db1 --- /dev/null +++ b/docs/batch-run-guide-AWS-Test.md @@ -0,0 +1,101 @@ +This document contains instructions for running a batch of autoeval pipelines in the AWS test account. + +The test account instructions assume that you are interacting with a Nomad API that is set up in a way similar to the [nomad-runner](https://github.com/NGWPC/nomad-runner) deployment. This deployment uses a single Nomad server that sends jobs to a group of EC2 instances in an ASG group. + +# Running a pipeline batch in the AWS test account + +## Size the cluster + +The ripple batch runs were executed using a c5.9xlarge server and between 10-40 r5a.xlarge clients. The more clients you are communicating with and the larger the instance size of the clients the larger your server instance needs to be. At these instance sizes for the pipeline batch jobs being run the maximum number of instances that could effectively be communicated with by the server was ~40. Currently the pipeline code is not designed to work with a Nomad API that autoscales clients because autoscaling causes jobs to be cancelled or lost and then rescheduled with a new dispatched job id. A recovery mechanism has not been implemented to deal with this event though it is planned. Because of this the number of clients needs to be set in the AWS autoscaling group at the start of a batch run by setting the "desired capacity" of the autoscaling group at the beginning of a run. For this approach to work the autoscaling job also needs to be turned off before the autoscaler capacity is set. A good rule of thumb for choosing the number of clients/desired capacity is to set the number of clients to half the number of pipelines that you want to run. If that number is higher than the max number of clients supported by the Nomad API then use the max number of clients. + +Eval pipeline batches are not designed to be run on a Nomad cluster being used by other workloads. In the future, once the pipeline has implemented more robust job tracking and with a more robust Nomad API it could be possible for a batch to be run alongside other workloads. + +## Export your NOMAD_TOKEN and NOMAD_ADDR + +These environment variables are used by the batch submission code to determine which Nomad API to send requests to and authenticate to that API. They can be set with: + +``` +export NOMAD_ADDR="http://localhost:4646" +export NOMAD_TOKEN="token" +``` + +These variables will be read from your environment when you start the autoeval container. + +## Refresh your S3 credentials + +The batch run script needs access to the S3 bucket the pipelines will output data to. This is because it uploads AOI that are used by the pipeline to S3. You should refresh your credentials either in the environment that you will start your autoeval container from. + +## Start the autoeval container + +Start the autoeval container by running the following from the repo root: + +``` +docker compose -f docker-compose-dev.yml up -d +docker compose -f docker-compose-dev.yml exec autoeval-dev bash +``` + +You should execute the batch code from this container's shell. + +## Configure Nomad Job definitions + +Most of the environment variables in ./job_defs/test/ should already be configured but if you are using a different NOMAD_ADDRESS from the one used by NGWPC then you should set that as well. Depending on the data being evaluated you might also want to adjust the job memory requirements in the "resources" block of the job definition. Please refer to the document docs/job_sizing_guide.md in this repo for guidance on how much memory to allocate to each autoeval-job based on the resolution of the data being evaluated. + +## Start the nomad memory monitor script + +Open another autoeval-dev container shell that is different from the one you will use to run a batch of pipelines using another `docker compose exec autoeval-dev bash` command and then from the local repo root start the script that monitors the Nomad servers memory usage with `tools/nomad_memory_monitor.sh`. This should be started in a separate terminal from the one you will use to submit the batch. This terminal also needs to have valid NOMAD_ADDR and NOMAD_TOKEN environment variables. The memory monitor script will create a log file at `nomad_memory_usage.log` and will run the command `nomad system gc` after the nomad servers active memory allocation exceeds the value for `MEMORY_THRESHOLD_GIB` hardcoded at the top of the script. `MEMORY_THRESHOLD_GIB` value should be set to about 25-30% of your Nomad server's max memory. + +Memory monitoring is necessary because after running jobs Nomad keeps old allocations and evaluations in memory for a configurable amount of time. If memory use gets too high then the server slows down and becomes unresponsive. The memory can be cleared on a set schedule by configuring the server (for example: every 15 minutes) but it was observed that the API could lose jobs during garbage collection events. So to minimize the number of garbage collection events while also ensuring that the server stayed responsive a dynamic approach was taken that monitors the memory usage of the server from the client side and clears the + +To run this script you need the Nomad cli installed. Instructions for installing the CLI [can be found here](https://developer.hashicorp.com/nomad/tutorials/get-started/gs-install). + +## Submit a batch of pipelines + +A batch of pipelines can be submitted using this command after it has been modified for the specifics of your batch: + +``` + python tools/submit_stac_batch.py --batch_name fim100_huc12_3m_2025-08-21-15 --output_root s3://fimc-data/autoeval/batches/fim100_huc12_3m_non_calibrated/ --hand_index_path s3://fimc-data/autoeval/hand_output_indices/fim100_huc12_3m_index/ --benchmark_sources "ripple-fim-collection" --item_list /home/dylan.lee/autoeval-coordinator/inputs/ripple-fim-collection-3m-run4.txt --wait_seconds 10 --stop_threshold 30 --resume_threshold 15 +``` + +The arguments are: + +* --batch_name: This is the name of the batch that will be included in the nomad job definitions. It is usually timestamped to the hour to make it possible to query different batches in CloudWatch. +* --output_root: This is the directory that the batch outputs will be written to. +* --hand_index_path: This is the directory that contains the HAND index that will be used to assemble the necessary HAND outputs to run each pipeline. +* --benchmark_sources: This is the list of benchmark STAC collections that you want to be evaluated +* --item_list: This is a list of the specific STAC item id's that will be evaluated. There will be a pipeline job submitted for each item on this list. +* --wait_seconds: This is the number of seconds to wait before submitting each pipeline job to the Nomad API. This should be 10 or more seconds to avoid hammering the Nomad API. +* --stop_threshold: This is the maximum number of running pipelines that should be running on Nomad at once. Tests revealed that this shouldn't be more than 30 pipelines in parallel or else the server will be overwhelmed. Once this threshold is reached `submit_stac_batch.py` will pause pipeline job submission and wait until the resume_threshold is reached. +* --resume_threshold: This is the threshold at which pipelines will start being submitted again by `submit_stac_batch.py`. The resume threshold should be 10-15 pipeline jobs below the stop_threshold. Having two thresholds introduces a pause that ensures that each pipeline job is able to submit jobs at the inundate, mosaic, and agreement stages without getting crowded out by other newer jobs. This pause was necessary because of how Nomad's job scheduling works. If it didn't exist then the Nomad scheduler would tend to preferentially place jobs with the lower resource requirements. This results in pipelines being hung up for unreasonable lengths of time and increases the risk of pipeline failure. + +## Evaluate the batch outcome + +After a batch has run then the script tools/cloudwatch_reports.py should be run from the autoeval-dev shell that you ran tools/submit_stac_batch.py from. + +The Test account that stores the cloudwatch logs currently needs different credentials from that used by the S3 bucket. You can update the credentials by exporting new 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', and 'AWS_SESSION_TOKEN' variables that work for the Test account cloudwatch service and then run the script using a modified form of: + +``` +./tools/cloudwatch_reports.py inputs/ripple-fim-collection-item-list.txt fim100_huc12_10m_2025-08-20-09 reports/ripple-10m-run3 +``` + +The first argument is the list of STAC item id's that were submitted by submit_stac_batch.py. The second argument is the batch name used by submit_stac_batch.py. The 3rd argument is where the report files will be written to. + +The file unique_fail_aoi_names.txt in the results directory has the list of failed pipeline aoi's. Usually these pipelines failed because of Nomad api, S3, or credential errors and will succeed if the failed aoi's are resubmitted. To resubmit all you have to do is copy the contents of unique_fail_aoi_names.txt to a batch item list file and then start another batch using that item list as an input to submit_stac_batch.py + +Refer to docs/intepreting-reports.md for more information on using a batch's reports to inspect a batch's outcome + +## Shutdown the memory monitor and purge jobs + +Kill the nomad_memory_monitor.sh script and then from that container run: + +``` +nomad system gc +python tools/purge_dispatch_jobs.py +``` + +This will clear all the jobs associated with the batch from the Nomad servers memory. This step ensures that the Nomad server stays responsive and makes it easier to use the Nomad UI to monitor the progress of the next batch that will be run. + +If this is the last batch you will run you can now kill the instance of the autoeval-dev shell that was being used to monitor memory. + +## Set the ASG to 1 client and turn autoscaler job back on + +After you have successfully run your batches you should then set the desired capacity of the ASG back to 1 client to save on costs. The autoscaler job should also be turned back on in case the next user has a workload for which it is useful. diff --git a/docs/batch-run-guide-ParallelWorks.md b/docs/batch-run-guide-ParallelWorks.md new file mode 100644 index 0000000..59e3bbf --- /dev/null +++ b/docs/batch-run-guide-ParallelWorks.md @@ -0,0 +1,157 @@ +This document contains instructions for running a batch of autoeval pipelines in Parallel Works on a single instance using a local Nomad cluster. + +# Running a pipeline batch in Parallel Works + +## Start the `fimsinglenode` cluster and attach a desktop + +Eval pipeline batches are not designed to be run on a Parallel Works cluster being used by other workloads. If another user is running a large workload on `fimsinglenode` it is recommended to create a clone of the cluster on which to execute the batch. + +## Start a terminal and navigate to the repo root + +Unless otherwise noted all the commands below should be run from the root of the Parallel Works clone of the `autoeval-coordinator` repostitory. Currently this clone is located at: `/efs/demonstrations/pi7/autoeval-coordinator` when using the `fimsinglenode` cluster. + +## Export your NOMAD_ADDR and AWS credentials + +These environment variables are used by the batch submission code to determine which Nomad API to send requests to and authenticate to the S3 bucket that outputs will be written to. + +The NOMAD_ADDR can be set with: + +``` +export NOMAD_ADDR="http://localhost:4646" +``` + +The domain is localhost since we are using a local Nomad cluster. + +The AWS creds for the NGWPC Data account also need to be exported. They can be obtained from the NGWPC AWS Access Portal. When you sign into that portal then if you have a role with access to the Data account then the appropriate credentials can be copied from the popup that shows up when you click "Access Keys". The credentials should look something like: + +``` +export AWS_ACCESS_KEY_ID="" +export AWS_SECRET_ACCESS_KEY="