sync: merge dcagents-leaderboard DB changes + add unified eval listener by richardzhuang0412 · Pull Request #18 · open-thoughts/OpenThoughts-Agent

richardzhuang0412 · 2026-02-20T08:06:27Z

Summary

Merge DB changes from mlfoundations/dcagents-leaderboard: Adds duplicate_of deduplication support to models/benchmarks tables, pending job status system (Pending → Started → Finished), Supabase client timeout configuration, and flexible dependency versioning
Add unified eval listener: Consolidates all individual eval listeners (aider, bfcl, swebench, v2, tb2, dev) and sbatch scripts into parameterized versions with preset-based configuration, priority file hot-reload, and configurable sbatch parameters
Add unified eval harbor sbatch: Parameterized sbatch script that replaces per-benchmark scripts, supports Pending→Started DB transitions, configurable VLLM/thinking/parser settings, and DaytonaError threshold checking

Files changed (8 files, +2190/-15)

Database (database/unified_db/):

complete_schema.sql — duplicate_of columns + self-reference constraints on models/benchmarks
models.py — duplicate_of fields, Optional SandboxJob fields, submitted_at/slurm_job_id
config.py — ClientOptions with postgrest/storage timeouts
utils.py — duplicate_of validation, inf/nan sanitization, pending job status functions
__init__.py — new exports for pending job functions
requirements.txt — supabase>=2.0.0,<3.0.0 (flexible)

Eval (eval/tacc/):

unified_eval_listener.py — preset-based listener with priority filtering, stale job detection, dry-run mode
unified_eval_harbor.sbatch — parameterized sbatch with env var configuration

Test plan

Verify DB schema migration applies cleanly (add duplicate_of columns)
Test unified listener with --dry-run --once on each preset
Confirm pending job status functions work end-to-end
Update hardcoded TACC paths from dc-agent-shared → OpenThoughts-Agent-shared before deployment

🤖 Generated with Claude Code

Incorporate upstream changes from the dcagents-leaderboard repo into the unified_db database module while preserving all OT-specific features. Schema changes (complete_schema.sql): - Add duplicate_of column to models table (FK self-ref, ON DELETE RESTRICT) - Add duplicate_of column to benchmarks table (same pattern) - Add self-reference prevention constraints and indexes for both Models (models.py): - Add duplicate_of field + serializer to ModelModel and BenchmarkModel - Add duplicate_of to clean_model_metadata and clean_benchmark_metadata - Make SandboxJobModel.n_trials/config/n_rep_eval Optional for Pending jobs - Add submitted_at and slurm_job_id fields to SandboxJobModel Config (config.py): - Add ClientOptions with 30s postgrest/storage timeouts Utils (utils.py): - Add duplicate_of parameter + validation to register_hf_model, register_local_model, and register_benchmark - Add duplicate_of to model_data dicts in registration functions - Add inf/nan sanitization retry logic to create_model - Add duplicate_of FK checks to delete_model_by_id and delete_benchmark_by_id (prevent deleting canonical entries) - Fix cardData null-safety in HF model description extraction - Add pending job status system: JOB_STATUS constants, get_job_by_model_benchmark, get_latest_job_for_model_benchmark, create_job_entry_pending, update_job_status_to_started, create_job_entry_started Exports (__init__.py): - Add calculate_standard_error and all pending job status functions Dependencies (requirements.txt): - Relax supabase pin from ==2.22.3 to >=2.0.0,<3.0.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add unified_eval_listener.py and unified_eval_harbor.sbatch that consolidate all individual eval listeners (aider, bfcl, swebench, v2, tb2, dev) and sbatch scripts into parameterized versions. unified_eval_listener.py: - Preset-based configuration (--preset bfcl/swebench/v2/tb2/aider/dev) - Priority file hot-reload for model filtering - Configurable sbatch parameters (n-concurrent, daytona-threshold, etc.) - Pending job status tracking with DB integration (Pending → Started → Finished) - Stale job detection and SLURM job cancellation for pending jobs - HuggingFace model existence validation (optional, per-preset) - Dry-run and single-iteration modes - CLI args > ENV vars > Preset defaults priority chain unified_eval_harbor.sbatch: - Parameterized via environment variables (replaces hardcoded per-benchmark scripts) - Supports Pending→Started DB job status transitions - Configurable VLLM GPU memory, thinking blocks, agent parser - DaytonaError threshold checking before result upload - Backward compatible: works both with listener (RUN_TAG provided) and standalone Both scripts use the pending job status functions added in the previous DB merge commit (create_job_entry_pending, update_job_status_to_started, create_job_entry_started, get_latest_job_for_model_benchmark). Note: Hardcoded TACC paths currently point to dc-agent-shared deployment; update to OpenThoughts-Agent-shared paths when deploying to OT cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-20T08:06:55Z

Summary of Changes

Hello @richardzhuang0412, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and enhances the evaluation and database interaction systems. It introduces robust deduplication capabilities for models and benchmarks, establishes a comprehensive pending job status system for better tracking of evaluation runs, and unifies the evaluation listener and SLURM submission scripts into highly configurable, centralized components. These changes aim to improve data integrity, streamline the evaluation workflow, and reduce redundancy across various evaluation benchmarks.

Highlights

Database Schema and Model Updates: Introduced duplicate_of columns with self-reference constraints and indexes to the models and benchmarks tables for deduplication support. The SandboxJobModel was enhanced to support a pending job status system, including new submitted_at and slurm_job_id fields, and making n_trials, config, and n_rep_eval optional for pending entries.
Supabase Client and Data Handling Improvements: Configured the Supabase client with explicit postgrest and storage timeouts. Implemented inf/nan sanitization for JSON data during model creation to handle non-compliant float values. Added validation logic to prevent duplicate_of fields from pointing to other duplicate entries or non-existent IDs.
Unified Evaluation Listener: Developed a new unified_eval_listener.py script that consolidates previous individual evaluation listeners. This listener supports preset-based configurations, hot-reloading of priority model lists, and configurable sbatch parameters, streamlining the evaluation submission process.
Unified Evaluation SLURM Script: Created a new unified_eval_harbor.sbatch script to replace multiple per-benchmark sbatch scripts. This parameterized script handles VLLM setup, updates job statuses in the database (Pending → Started), and includes configurable settings for VLLM, agent parsing, and DaytonaError threshold checking.
Pending Job Status System: Introduced a new job status workflow (Pending → Started → Finished) within the database utilities. Functions were added to create minimal pending job entries upon SLURM submission and to update them to 'Started' status with full details when the job begins execution.

Changelog

database/unified_db/init.py
- Exported new utility functions, including calculate_standard_error.
- Exported new pending job status functions: create_job_entry_pending, update_job_status_to_started, get_job_by_model_benchmark, get_latest_job_for_model_benchmark, and create_job_entry_started.
database/unified_db/complete_schema.sql
- Added a duplicate_of column to the models table, referencing models(id) with ON DELETE RESTRICT.
- Added a models_no_self_duplicate constraint to prevent models from being a duplicate of themselves.
- Added an index idx_models_duplicate_of for the new duplicate_of column in the models table.
- Added a duplicate_of column to the benchmarks table, referencing benchmarks(id) with ON DELETE RESTRICT.
- Added a benchmarks_no_self_duplicate constraint to prevent benchmarks from being a duplicate of themselves.
- Added an index idx_benchmarks_duplicate_of for the new duplicate_of column in the benchmarks table.
- Updated schema comment to reflect merging with DC-Agents additions for duplicate_of support.
database/unified_db/config.py
- Configured the Supabase client to use ClientOptions with postgrest_client_timeout and storage_client_timeout set to 30 seconds.
database/unified_db/models.py
- Added an optional duplicate_of field (UUID) to ModelModel to reference canonical models.
- Added a serializer for the new duplicate_of field in ModelModel.
- Included the duplicate_of field in the clean_model_metadata function.
- Added an optional duplicate_of field (UUID) to BenchmarkModel to reference canonical benchmarks.
- Added a serializer for the new duplicate_of field in BenchmarkModel.
- Included the duplicate_of field in the clean_benchmark_metadata function.
- Made n_trials, config, and n_rep_eval fields optional in SandboxJobModel.
- Added optional submitted_at (datetime) and slurm_job_id (string) fields to SandboxJobModel.
- Updated job_status comment in SandboxJobModel to reflect 'Pending', 'Started', 'Finished' states.
- Added a serializer for the new submitted_at field in SandboxJobModel.
- Included the new submitted_at and slurm_job_id fields in the clean_sandbox_job_metadata function.
database/unified_db/requirements.txt
- Updated the supabase dependency from ==2.22.3 to a flexible version range >=2.0.0,<3.0.0.
database/unified_db/utils.py
- Imported additional modules: io, tarfile, tempfile, dataclass, PurePosixPath, Iterator, Sequence.
- Implemented inf/nan sanitization logic within create_model to handle non-JSON compliant float values by converting them to string representations.
- Added an optional duplicate_of parameter to register_hf_model and register_local_model functions.
- Implemented validation for the duplicate_of field in register_hf_model and register_local_model to ensure it points to a valid, non-duplicate canonical entry.
- Added an optional duplicate_of parameter to register_benchmark function.
- Implemented validation for the duplicate_of field in register_benchmark to ensure it points to a valid, non-duplicate canonical entry.
- Added a check in delete_model_by_id to prevent deletion if the model is marked as canonical for other duplicates.
- Added a check in delete_benchmark_by_id to prevent deletion if the benchmark is marked as canonical for other duplicates.
- Introduced new constants JOB_STATUS_PENDING, JOB_STATUS_STARTED, JOB_STATUS_FINISHED.
- Added get_job_by_model_benchmark function to retrieve the most recent job for a given model and benchmark.
- Added get_latest_job_for_model_benchmark function to retrieve the most recent job using HuggingFace names.
- Added create_job_entry_pending function to create a minimal job entry with 'Pending' status upon SLURM submission.
- Added update_job_status_to_started function to transition a 'Pending' job to 'Started' status with full details.
- Added create_job_entry_started function to create a fully populated job entry with 'Started' status, handling existing 'Pending' entries.
eval/tacc/unified_eval_harbor.sbatch
- Added a new SLURM batch script that serves as a unified evaluation runner.
- Implemented parameterized configuration via environment variables for N_CONCURRENT, N_ATTEMPTS, GPU_MEMORY_UTIL, DAYTONA_ERROR_THRESHOLD, VLLM_MAX_RETRIES, AGENT_PARSER, ENABLE_THINKING, and AGENT_NAME.
- Included logic to start a VLLM server and wait for its readiness.
- Integrated Python snippets to update job status in the database (Pending to Started) and to upload evaluation results.
- Added error handling to check for excessive DaytonaErrors before uploading results, allowing jobs to complete without upload if a threshold is exceeded.
eval/tacc/unified_eval_listener.py
- Added a new Python script that consolidates multiple individual evaluation listeners into a single, unified listener.
- Defined presets for various benchmarks (aider, bfcl, swebench, v2, tb2, dev) with specific configurations.
- Implemented hot-reloading of priority model lists from a file, allowing dynamic updates without restarting the listener.
- Provided command-line arguments and environment variables for extensive configuration of datasets, sbatch scripts, timing, priority filtering, and sbatch parameters.
- Included logic to fetch recent models from Supabase, resolve HuggingFace model names, and check for model existence on HuggingFace Hub.
- Developed check_job_status and should_start_job functions to manage job lifecycle, including detecting stale 'Started' or 'Pending' jobs and canceling old SLURM jobs.
- Integrated with the new unified_eval_harbor.sbatch script for submitting evaluation jobs.

Activity

The pull request was created by richardzhuang0412.
The initial changes were generated using Claude Code.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant new functionality by merging database changes for deduplication and pending job statuses, and adding a unified evaluation listener and sbatch script. A critical command injection vulnerability was identified in the unified_eval_harbor.sbatch script due to the unsafe use of eval on a command string built from unvalidated database-derived arguments, which could allow an attacker to execute arbitrary commands on the compute cluster. Additionally, the review focuses on improving maintainability by reducing code duplication, ensuring consistency between the Python models and the database schema, and addressing critical hardcoded paths in new scripts that must be resolved before deployment.

gemini-code-assist · 2026-02-20T08:10:01Z

eval/tacc/unified_eval_harbor.sbatch

+HARBOR_CMD="harbor jobs start"
+HARBOR_CMD="$HARBOR_CMD -p \"$DATASET_PATH\""
+HARBOR_CMD="$HARBOR_CMD --n-concurrent $N_CONCURRENT"
+HARBOR_CMD="$HARBOR_CMD --agent \"$AGENT_NAME\""
+HARBOR_CMD="$HARBOR_CMD --model \"hosted_vllm/$MODEL\""
+HARBOR_CMD="$HARBOR_CMD --env \"daytona\""
+HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"api_base=http://localhost:8000/v1\""
+HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"key=fake_key\""
+HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"max_tokens=16384\""
+HARBOR_CMD="$HARBOR_CMD --n-attempts $N_ATTEMPTS"
+HARBOR_CMD="$HARBOR_CMD --export-traces"
+HARBOR_CMD="$HARBOR_CMD --job-name \"$RUN_TAG\""
+HARBOR_CMD="$HARBOR_CMD --config \"dcagent_eval_config.yaml\""
+
+# Add parser arg if specified (e.g., for swebench)
+if [ -n "$AGENT_PARSER" ]; then
+    HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"parser=$AGENT_PARSER\""
+fi
+
+# Add extra_body for thinking control (only if enable_thinking is true)
+if [ "$ENABLE_THINKING" = "true" ]; then
+    HARBOR_CMD="$HARBOR_CMD --agent-kwarg 'extra_body={\"chat_template_kwargs\":{\"enable_thinking\":true}}'"
+    echo "Thinking blocks enabled via extra_body"
+fi
+
+echo "Running: $HARBOR_CMD"
+
+# Run sandbox evaluation
+set +e
+eval $HARBOR_CMD


The script uses eval to execute a command string (HARBOR_CMD) that is constructed using unvalidated positional arguments and environment variables. Specifically, the MODEL ($1) and RUN_TAG ($4) arguments, which are derived from the database in unified_eval_listener.py, can contain shell metacharacters (e.g., ;, ", `) that allow an attacker to break out of the intended command and execute arbitrary shell commands on the TACC cluster. This is a critical command injection vulnerability.

To remediate this, avoid using eval. Instead, build the command using a bash array and execute it directly. This ensures that arguments are passed as distinct strings and not interpreted by the shell.

HARBOR_CMD=("harbor" "jobs" "start") HARBOR_CMD+=("-p" "$DATASET_PATH") HARBOR_CMD+=("--n-concurrent" "$N_CONCURRENT") HARBOR_CMD+=("--agent" "$AGENT_NAME") HARBOR_CMD+=("--model" "hosted_vllm/$MODEL") HARBOR_CMD+=("--env" "daytona") HARBOR_CMD+=("--agent-kwarg" "api_base=http://localhost:8000/v1") HARBOR_CMD+=("--agent-kwarg" "key=fake_key") HARBOR_CMD+=("--agent-kwarg" "max_tokens=16384") HARBOR_CMD+=("--n-attempts" "$N_ATTEMPTS") HARBOR_CMD+=("--export-traces") HARBOR_CMD+=("--job-name" "$RUN_TAG") HARBOR_CMD+=("--config" "dcagent_eval_config.yaml") if [ -n "$AGENT_PARSER" ]; then HARBOR_CMD+=("--agent-kwarg" "parser=$AGENT_PARSER") fi if [ "$ENABLE_THINKING" = "true" ]; then HARBOR_CMD+=("--agent-kwarg" 'extra_body={"chat_template_kwargs":{"enable_thinking":true}}') fi "${HARBOR_CMD[@]}"

gemini-code-assist · 2026-02-20T08:10:01Z

eval/tacc/unified_eval_harbor.sbatch

+
+python - <<'PY'
+import os, sys, json
+sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard")


This script contains a hardcoded path: /scratch/08134/negin/dc-agent-shared/dcagents-leaderboard. This makes the script non-portable and tightly coupled to a specific user's environment on TACC. This path should be made configurable, for example, through an environment variable that can be set in the execution environment or sourced from a config file. This issue is also present on lines 97, 100, 101, 105, and 233.

gemini-code-assist · 2026-02-20T08:10:01Z

eval/tacc/unified_eval_listener.py

+from typing import Dict, List, Optional, Set, Tuple
+
+# Add leaderboard utilities to path
+sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard")


The script has a hardcoded path: sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard"). This makes the script non-portable and will cause it to fail on any machine where this specific path does not exist. This path should be configured via an environment variable or by installing the dcagents-leaderboard package in a way that it's available on the PYTHONPATH.

gemini-code-assist · 2026-02-20T08:10:01Z

database/unified_db/utils.py

+            "config": config or {},
+            "n_trials": 0,
+            "n_rep_eval": 0,


The SandboxJobModel in models.py defines n_trials and n_rep_eval as Optional, but the database schema in complete_schema.sql appears to define them as INTEGER NOT NULL. Here in create_job_entry_pending, you are setting them to 0, which works around the NOT NULL constraint but is semantically incorrect for a pending job where these values are unknown. The database schema should be updated to allow NULL for these fields to align with the Pydantic model and correctly represent the state of a pending job.

gemini-code-assist · 2026-02-20T08:10:01Z

eval/tacc/unified_eval_harbor.sbatch

+python - <<'PY' 2>&1 | tee -a "$UPLOAD_LOG"
+import os, sys
+from unified_db.utils import upload_eval_results
+import re
+import hashlib
+
+
+def sanitize_hf_repo_id(repo_id: str, max_length: int = 96) -> str:
+    """
+    Sanitize a Hugging Face repo_id to comply with naming rules.
+    Keeps org prefix (e.g. 'mlfoundations-dev/') and cleans up the rest.
+    No extra '-' before hash suffix.
+    """
+    def collapse(s: str) -> str:
+        prev = None
+        while s != prev:
+            prev = s
+            s = s.replace("--", "-").replace("..", ".")
+        return s
+
+    org, name = repo_id.split("/", 1) if "/" in repo_id else (None, repo_id)
+
+    name = re.sub(r"[^A-Za-z0-9._-]", "-", name)
+    name = collapse(name).strip("-.")
+
+    if not name:
+        name = "repo"
+
+    limit = max_length - (len(org) + 1 if org else 0)
+    if len(name) > limit:
+        digest = hashlib.sha1(name.encode()).hexdigest()[:8]
+        keep = max(1, limit - len(digest))
+        base = name[:keep].rstrip("-.")
+        if not base:
+            base = "r"
+        name = f"{base}{digest}"  # no '-' before hash
+        name = collapse(name).strip("-.")
+
+    # final cleanup
+    name = collapse(name).strip("-.")
+    if name[0] in "-.":
+        name = "r" + name[1:]
+    if name[-1] in "-.":
+        name = name[:-1] + "0"
+
+    return f"{org}/{name}" if org else name
+
+
+run_dir   = os.environ["RUN_DIR"]
+run_tag   = os.environ["RUN_TAG"]
+username  = os.environ.get("UPLOAD_USERNAME", "negin")
+error_mode= os.environ.get("UPLOAD_MODE", "skip_on_error")
+hf_repo_id = sanitize_hf_repo_id(f"DCAgent2/{run_tag}")
+hf_token = os.environ["HF_TOKEN"]
+print(f"[uploader] upload_eval_results(path={run_dir!r}, username={username!r}, error_mode={error_mode!r}, hf_repo_id={hf_repo_id!r})")
+upload_eval_results(run_dir, username=username, error_mode=error_mode, hf_token=hf_token, hf_repo_id=hf_repo_id, register_benchmark=True)
+print("[uploader] done.")
+PY


This script includes a very large and complex inline Python script for uploading results. Embedding significant logic within a bash script heredoc makes the code difficult to read, maintain, test, and debug. It also duplicates logic (like sanitize_hf_repo_id) that might be useful elsewhere. It would be much better to extract this Python code into a separate, standalone script that can be called from this sbatch script. This applies to the other Python blocks in this file as well (lines 175-223, 231-245, 334-362, 384-398).

gemini-code-assist · 2026-02-20T08:10:01Z

database/unified_db/utils.py

+    except ValueError as e:
+        # Check if the error is specifically the JSON 'inf' compliant error
+        if "Out of range float values" in str(e):
+            logger.warning("Detected non-JSON compliant floats (inf/nan). Applying patch...")
+
+            import math
+            def sanitize(obj):
+                if isinstance(obj, dict):
+                    return {k: sanitize(v) for k, v in obj.items()}
+                elif isinstance(obj, list):
+                    return [sanitize(v) for v in obj]
+                elif isinstance(obj, float):
+                    if math.isinf(obj):
+                        logger.info("Replacing 'inf' float with string 'inf'")
+                        return "inf"
+                    if math.isnan(obj):
+                        logger.info("Replacing 'nan' float with string 'nan'")
+                        return "nan"
+                return obj
+
+            # Sanitize and retry
+            model_data = sanitize(model_data)
+            client = get_supabase_client(use_admin=True)
+            response = client.table('models').insert(model_data).execute()
+            if not response.data:
+                raise ValueError("Failed to create model")
+
+            return clean_model_metadata(response.data[0])
+        else:
+            # Re-raise if it's a different ValueError
+            raise e


The sanitize function and import math are defined within the except ValueError block. It's better practice to define helper functions and imports at the module level for clarity, reusability, and to avoid re-definition on every exception. Additionally, this sanitization logic for inf/nan values is only applied to create_model. Other functions that handle JSONB data, such as update_model or functions creating/updating sandbox_jobs with config, metrics, or stats fields, could also benefit from this sanitization to prevent similar errors.

gemini-code-assist · 2026-02-20T08:10:01Z

database/unified_db/utils.py

+        # Validate duplicate_of if provided
+        if duplicate_of:
+            client = get_supabase_client()
+            target = client.table('models').select('id, duplicate_of').eq('id', duplicate_of).execute()
+            if not target.data:
+                return {"success": False, "error": f"duplicate_of target model {duplicate_of} not found"}
+            if target.data[0].get('duplicate_of'):
+                return {"success": False, "error": f"duplicate_of target {duplicate_of} is itself a duplicate. Only point to canonical entries."}


This validation logic for the duplicate_of field is repeated in register_hf_model, register_local_model (lines 834-841), and register_benchmark (lines 1453-1460). To improve maintainability and reduce code duplication, consider extracting this logic into a reusable helper function. The function could take the table name and the ID to validate as arguments.

gemini-code-assist · 2026-02-20T08:10:01Z

eval/tacc/unified_eval_harbor.sbatch

+trap cleanup EXIT
+
+# Wait for VLLM server to start with healthcheck
+RETRY_INTERVAL=100


The RETRY_INTERVAL for the VLLM server healthcheck is set to 100 seconds. This seems excessively long for a retry interval and might be a typo. A shorter interval, such as 10 or 15 seconds, would allow for faster job startup or failure detection.

Suggested change

RETRY_INTERVAL=100

RETRY_INTERVAL=10

penfever · 2026-03-05T16:36:38Z

@richardzhuang0412 , can you please update this to the new v3 listener, address the Gemini notes and merge the latest from main? Would be great to get this merged soon!

richardzhuang0412 and others added 2 commits February 19, 2026 23:59

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: merge dcagents-leaderboard DB changes + add unified eval listener#18

sync: merge dcagents-leaderboard DB changes + add unified eval listener#18
richardzhuang0412 wants to merge 2 commits intomainfrom
sync/dcagents-leaderboard-db-merge-and-unified-listener

richardzhuang0412 commented Feb 20, 2026

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

gemini-code-assist bot Feb 20, 2026

Uh oh!

penfever commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

richardzhuang0412 commented Feb 20, 2026

Summary

Files changed (8 files, +2190/-15)

Test plan

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

penfever commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants