Skip to content

sync: merge dcagents-leaderboard DB changes + add unified eval listener#18

Open
richardzhuang0412 wants to merge 2 commits intomainfrom
sync/dcagents-leaderboard-db-merge-and-unified-listener
Open

sync: merge dcagents-leaderboard DB changes + add unified eval listener#18
richardzhuang0412 wants to merge 2 commits intomainfrom
sync/dcagents-leaderboard-db-merge-and-unified-listener

Conversation

@richardzhuang0412
Copy link
Collaborator

Summary

  • Merge DB changes from mlfoundations/dcagents-leaderboard: Adds duplicate_of deduplication support to models/benchmarks tables, pending job status system (Pending → Started → Finished), Supabase client timeout configuration, and flexible dependency versioning
  • Add unified eval listener: Consolidates all individual eval listeners (aider, bfcl, swebench, v2, tb2, dev) and sbatch scripts into parameterized versions with preset-based configuration, priority file hot-reload, and configurable sbatch parameters
  • Add unified eval harbor sbatch: Parameterized sbatch script that replaces per-benchmark scripts, supports Pending→Started DB transitions, configurable VLLM/thinking/parser settings, and DaytonaError threshold checking

Files changed (8 files, +2190/-15)

Database (database/unified_db/):

  • complete_schema.sqlduplicate_of columns + self-reference constraints on models/benchmarks
  • models.pyduplicate_of fields, Optional SandboxJob fields, submitted_at/slurm_job_id
  • config.pyClientOptions with postgrest/storage timeouts
  • utils.py — duplicate_of validation, inf/nan sanitization, pending job status functions
  • __init__.py — new exports for pending job functions
  • requirements.txtsupabase>=2.0.0,<3.0.0 (flexible)

Eval (eval/tacc/):

  • unified_eval_listener.py — preset-based listener with priority filtering, stale job detection, dry-run mode
  • unified_eval_harbor.sbatch — parameterized sbatch with env var configuration

Test plan

  • Verify DB schema migration applies cleanly (add duplicate_of columns)
  • Test unified listener with --dry-run --once on each preset
  • Confirm pending job status functions work end-to-end
  • Update hardcoded TACC paths from dc-agent-sharedOpenThoughts-Agent-shared before deployment

🤖 Generated with Claude Code

richardzhuang0412 and others added 2 commits February 19, 2026 23:59
Incorporate upstream changes from the dcagents-leaderboard repo into
the unified_db database module while preserving all OT-specific features.

Schema changes (complete_schema.sql):
- Add duplicate_of column to models table (FK self-ref, ON DELETE RESTRICT)
- Add duplicate_of column to benchmarks table (same pattern)
- Add self-reference prevention constraints and indexes for both

Models (models.py):
- Add duplicate_of field + serializer to ModelModel and BenchmarkModel
- Add duplicate_of to clean_model_metadata and clean_benchmark_metadata
- Make SandboxJobModel.n_trials/config/n_rep_eval Optional for Pending jobs
- Add submitted_at and slurm_job_id fields to SandboxJobModel

Config (config.py):
- Add ClientOptions with 30s postgrest/storage timeouts

Utils (utils.py):
- Add duplicate_of parameter + validation to register_hf_model,
  register_local_model, and register_benchmark
- Add duplicate_of to model_data dicts in registration functions
- Add inf/nan sanitization retry logic to create_model
- Add duplicate_of FK checks to delete_model_by_id and
  delete_benchmark_by_id (prevent deleting canonical entries)
- Fix cardData null-safety in HF model description extraction
- Add pending job status system: JOB_STATUS constants,
  get_job_by_model_benchmark, get_latest_job_for_model_benchmark,
  create_job_entry_pending, update_job_status_to_started,
  create_job_entry_started

Exports (__init__.py):
- Add calculate_standard_error and all pending job status functions

Dependencies (requirements.txt):
- Relax supabase pin from ==2.22.3 to >=2.0.0,<3.0.0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unified_eval_listener.py and unified_eval_harbor.sbatch that
consolidate all individual eval listeners (aider, bfcl, swebench,
v2, tb2, dev) and sbatch scripts into parameterized versions.

unified_eval_listener.py:
- Preset-based configuration (--preset bfcl/swebench/v2/tb2/aider/dev)
- Priority file hot-reload for model filtering
- Configurable sbatch parameters (n-concurrent, daytona-threshold, etc.)
- Pending job status tracking with DB integration (Pending → Started → Finished)
- Stale job detection and SLURM job cancellation for pending jobs
- HuggingFace model existence validation (optional, per-preset)
- Dry-run and single-iteration modes
- CLI args > ENV vars > Preset defaults priority chain

unified_eval_harbor.sbatch:
- Parameterized via environment variables (replaces hardcoded per-benchmark scripts)
- Supports Pending→Started DB job status transitions
- Configurable VLLM GPU memory, thinking blocks, agent parser
- DaytonaError threshold checking before result upload
- Backward compatible: works both with listener (RUN_TAG provided) and standalone

Both scripts use the pending job status functions added in the previous
DB merge commit (create_job_entry_pending, update_job_status_to_started,
create_job_entry_started, get_latest_job_for_model_benchmark).

Note: Hardcoded TACC paths currently point to dc-agent-shared deployment;
update to OpenThoughts-Agent-shared paths when deploying to OT cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @richardzhuang0412, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and enhances the evaluation and database interaction systems. It introduces robust deduplication capabilities for models and benchmarks, establishes a comprehensive pending job status system for better tracking of evaluation runs, and unifies the evaluation listener and SLURM submission scripts into highly configurable, centralized components. These changes aim to improve data integrity, streamline the evaluation workflow, and reduce redundancy across various evaluation benchmarks.

Highlights

  • Database Schema and Model Updates: Introduced duplicate_of columns with self-reference constraints and indexes to the models and benchmarks tables for deduplication support. The SandboxJobModel was enhanced to support a pending job status system, including new submitted_at and slurm_job_id fields, and making n_trials, config, and n_rep_eval optional for pending entries.
  • Supabase Client and Data Handling Improvements: Configured the Supabase client with explicit postgrest and storage timeouts. Implemented inf/nan sanitization for JSON data during model creation to handle non-compliant float values. Added validation logic to prevent duplicate_of fields from pointing to other duplicate entries or non-existent IDs.
  • Unified Evaluation Listener: Developed a new unified_eval_listener.py script that consolidates previous individual evaluation listeners. This listener supports preset-based configurations, hot-reloading of priority model lists, and configurable sbatch parameters, streamlining the evaluation submission process.
  • Unified Evaluation SLURM Script: Created a new unified_eval_harbor.sbatch script to replace multiple per-benchmark sbatch scripts. This parameterized script handles VLLM setup, updates job statuses in the database (Pending → Started), and includes configurable settings for VLLM, agent parsing, and DaytonaError threshold checking.
  • Pending Job Status System: Introduced a new job status workflow (PendingStartedFinished) within the database utilities. Functions were added to create minimal pending job entries upon SLURM submission and to update them to 'Started' status with full details when the job begins execution.
Changelog
  • database/unified_db/init.py
    • Exported new utility functions, including calculate_standard_error.
    • Exported new pending job status functions: create_job_entry_pending, update_job_status_to_started, get_job_by_model_benchmark, get_latest_job_for_model_benchmark, and create_job_entry_started.
  • database/unified_db/complete_schema.sql
    • Added a duplicate_of column to the models table, referencing models(id) with ON DELETE RESTRICT.
    • Added a models_no_self_duplicate constraint to prevent models from being a duplicate of themselves.
    • Added an index idx_models_duplicate_of for the new duplicate_of column in the models table.
    • Added a duplicate_of column to the benchmarks table, referencing benchmarks(id) with ON DELETE RESTRICT.
    • Added a benchmarks_no_self_duplicate constraint to prevent benchmarks from being a duplicate of themselves.
    • Added an index idx_benchmarks_duplicate_of for the new duplicate_of column in the benchmarks table.
    • Updated schema comment to reflect merging with DC-Agents additions for duplicate_of support.
  • database/unified_db/config.py
    • Configured the Supabase client to use ClientOptions with postgrest_client_timeout and storage_client_timeout set to 30 seconds.
  • database/unified_db/models.py
    • Added an optional duplicate_of field (UUID) to ModelModel to reference canonical models.
    • Added a serializer for the new duplicate_of field in ModelModel.
    • Included the duplicate_of field in the clean_model_metadata function.
    • Added an optional duplicate_of field (UUID) to BenchmarkModel to reference canonical benchmarks.
    • Added a serializer for the new duplicate_of field in BenchmarkModel.
    • Included the duplicate_of field in the clean_benchmark_metadata function.
    • Made n_trials, config, and n_rep_eval fields optional in SandboxJobModel.
    • Added optional submitted_at (datetime) and slurm_job_id (string) fields to SandboxJobModel.
    • Updated job_status comment in SandboxJobModel to reflect 'Pending', 'Started', 'Finished' states.
    • Added a serializer for the new submitted_at field in SandboxJobModel.
    • Included the new submitted_at and slurm_job_id fields in the clean_sandbox_job_metadata function.
  • database/unified_db/requirements.txt
    • Updated the supabase dependency from ==2.22.3 to a flexible version range >=2.0.0,<3.0.0.
  • database/unified_db/utils.py
    • Imported additional modules: io, tarfile, tempfile, dataclass, PurePosixPath, Iterator, Sequence.
    • Implemented inf/nan sanitization logic within create_model to handle non-JSON compliant float values by converting them to string representations.
    • Added an optional duplicate_of parameter to register_hf_model and register_local_model functions.
    • Implemented validation for the duplicate_of field in register_hf_model and register_local_model to ensure it points to a valid, non-duplicate canonical entry.
    • Added an optional duplicate_of parameter to register_benchmark function.
    • Implemented validation for the duplicate_of field in register_benchmark to ensure it points to a valid, non-duplicate canonical entry.
    • Added a check in delete_model_by_id to prevent deletion if the model is marked as canonical for other duplicates.
    • Added a check in delete_benchmark_by_id to prevent deletion if the benchmark is marked as canonical for other duplicates.
    • Introduced new constants JOB_STATUS_PENDING, JOB_STATUS_STARTED, JOB_STATUS_FINISHED.
    • Added get_job_by_model_benchmark function to retrieve the most recent job for a given model and benchmark.
    • Added get_latest_job_for_model_benchmark function to retrieve the most recent job using HuggingFace names.
    • Added create_job_entry_pending function to create a minimal job entry with 'Pending' status upon SLURM submission.
    • Added update_job_status_to_started function to transition a 'Pending' job to 'Started' status with full details.
    • Added create_job_entry_started function to create a fully populated job entry with 'Started' status, handling existing 'Pending' entries.
  • eval/tacc/unified_eval_harbor.sbatch
    • Added a new SLURM batch script that serves as a unified evaluation runner.
    • Implemented parameterized configuration via environment variables for N_CONCURRENT, N_ATTEMPTS, GPU_MEMORY_UTIL, DAYTONA_ERROR_THRESHOLD, VLLM_MAX_RETRIES, AGENT_PARSER, ENABLE_THINKING, and AGENT_NAME.
    • Included logic to start a VLLM server and wait for its readiness.
    • Integrated Python snippets to update job status in the database (Pending to Started) and to upload evaluation results.
    • Added error handling to check for excessive DaytonaErrors before uploading results, allowing jobs to complete without upload if a threshold is exceeded.
  • eval/tacc/unified_eval_listener.py
    • Added a new Python script that consolidates multiple individual evaluation listeners into a single, unified listener.
    • Defined presets for various benchmarks (aider, bfcl, swebench, v2, tb2, dev) with specific configurations.
    • Implemented hot-reloading of priority model lists from a file, allowing dynamic updates without restarting the listener.
    • Provided command-line arguments and environment variables for extensive configuration of datasets, sbatch scripts, timing, priority filtering, and sbatch parameters.
    • Included logic to fetch recent models from Supabase, resolve HuggingFace model names, and check for model existence on HuggingFace Hub.
    • Developed check_job_status and should_start_job functions to manage job lifecycle, including detecting stale 'Started' or 'Pending' jobs and canceling old SLURM jobs.
    • Integrated with the new unified_eval_harbor.sbatch script for submitting evaluation jobs.
Activity
  • The pull request was created by richardzhuang0412.
  • The initial changes were generated using Claude Code.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by merging database changes for deduplication and pending job statuses, and adding a unified evaluation listener and sbatch script. A critical command injection vulnerability was identified in the unified_eval_harbor.sbatch script due to the unsafe use of eval on a command string built from unvalidated database-derived arguments, which could allow an attacker to execute arbitrary commands on the compute cluster. Additionally, the review focuses on improving maintainability by reducing code duplication, ensuring consistency between the Python models and the database schema, and addressing critical hardcoded paths in new scripts that must be resolved before deployment.

Comment on lines +262 to +291
HARBOR_CMD="harbor jobs start"
HARBOR_CMD="$HARBOR_CMD -p \"$DATASET_PATH\""
HARBOR_CMD="$HARBOR_CMD --n-concurrent $N_CONCURRENT"
HARBOR_CMD="$HARBOR_CMD --agent \"$AGENT_NAME\""
HARBOR_CMD="$HARBOR_CMD --model \"hosted_vllm/$MODEL\""
HARBOR_CMD="$HARBOR_CMD --env \"daytona\""
HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"api_base=http://localhost:8000/v1\""
HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"key=fake_key\""
HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"max_tokens=16384\""
HARBOR_CMD="$HARBOR_CMD --n-attempts $N_ATTEMPTS"
HARBOR_CMD="$HARBOR_CMD --export-traces"
HARBOR_CMD="$HARBOR_CMD --job-name \"$RUN_TAG\""
HARBOR_CMD="$HARBOR_CMD --config \"dcagent_eval_config.yaml\""

# Add parser arg if specified (e.g., for swebench)
if [ -n "$AGENT_PARSER" ]; then
HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"parser=$AGENT_PARSER\""
fi

# Add extra_body for thinking control (only if enable_thinking is true)
if [ "$ENABLE_THINKING" = "true" ]; then
HARBOR_CMD="$HARBOR_CMD --agent-kwarg 'extra_body={\"chat_template_kwargs\":{\"enable_thinking\":true}}'"
echo "Thinking blocks enabled via extra_body"
fi

echo "Running: $HARBOR_CMD"

# Run sandbox evaluation
set +e
eval $HARBOR_CMD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The script uses eval to execute a command string (HARBOR_CMD) that is constructed using unvalidated positional arguments and environment variables. Specifically, the MODEL ($1) and RUN_TAG ($4) arguments, which are derived from the database in unified_eval_listener.py, can contain shell metacharacters (e.g., ;, ", `) that allow an attacker to break out of the intended command and execute arbitrary shell commands on the TACC cluster. This is a critical command injection vulnerability.

To remediate this, avoid using eval. Instead, build the command using a bash array and execute it directly. This ensures that arguments are passed as distinct strings and not interpreted by the shell.

HARBOR_CMD=("harbor" "jobs" "start")
HARBOR_CMD+=("-p" "$DATASET_PATH")
HARBOR_CMD+=("--n-concurrent" "$N_CONCURRENT")
HARBOR_CMD+=("--agent" "$AGENT_NAME")
HARBOR_CMD+=("--model" "hosted_vllm/$MODEL")
HARBOR_CMD+=("--env" "daytona")
HARBOR_CMD+=("--agent-kwarg" "api_base=http://localhost:8000/v1")
HARBOR_CMD+=("--agent-kwarg" "key=fake_key")
HARBOR_CMD+=("--agent-kwarg" "max_tokens=16384")
HARBOR_CMD+=("--n-attempts" "$N_ATTEMPTS")
HARBOR_CMD+=("--export-traces")
HARBOR_CMD+=("--job-name" "$RUN_TAG")
HARBOR_CMD+=("--config" "dcagent_eval_config.yaml")

if [ -n "$AGENT_PARSER" ]; then
    HARBOR_CMD+=("--agent-kwarg" "parser=$AGENT_PARSER")
fi

if [ "$ENABLE_THINKING" = "true" ]; then
    HARBOR_CMD+=("--agent-kwarg" 'extra_body={"chat_template_kwargs":{"enable_thinking":true}}')
fi

"${HARBOR_CMD[@]}"


python - <<'PY'
import os, sys, json
sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script contains a hardcoded path: /scratch/08134/negin/dc-agent-shared/dcagents-leaderboard. This makes the script non-portable and tightly coupled to a specific user's environment on TACC. This path should be made configurable, for example, through an environment variable that can be set in the execution environment or sourced from a config file. This issue is also present on lines 97, 100, 101, 105, and 233.

from typing import Dict, List, Optional, Set, Tuple

# Add leaderboard utilities to path
sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script has a hardcoded path: sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard"). This makes the script non-portable and will cause it to fail on any machine where this specific path does not exist. This path should be configured via an environment variable or by installing the dcagents-leaderboard package in a way that it's available on the PYTHONPATH.

Comment on lines +4663 to +4665
"config": config or {},
"n_trials": 0,
"n_rep_eval": 0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SandboxJobModel in models.py defines n_trials and n_rep_eval as Optional, but the database schema in complete_schema.sql appears to define them as INTEGER NOT NULL. Here in create_job_entry_pending, you are setting them to 0, which works around the NOT NULL constraint but is semantically incorrect for a pending job where these values are unknown. The database schema should be updated to allow NULL for these fields to align with the Pydantic model and correctly represent the state of a pending job.

Comment on lines +428 to +485
python - <<'PY' 2>&1 | tee -a "$UPLOAD_LOG"
import os, sys
from unified_db.utils import upload_eval_results
import re
import hashlib


def sanitize_hf_repo_id(repo_id: str, max_length: int = 96) -> str:
"""
Sanitize a Hugging Face repo_id to comply with naming rules.
Keeps org prefix (e.g. 'mlfoundations-dev/') and cleans up the rest.
No extra '-' before hash suffix.
"""
def collapse(s: str) -> str:
prev = None
while s != prev:
prev = s
s = s.replace("--", "-").replace("..", ".")
return s

org, name = repo_id.split("/", 1) if "/" in repo_id else (None, repo_id)

name = re.sub(r"[^A-Za-z0-9._-]", "-", name)
name = collapse(name).strip("-.")

if not name:
name = "repo"

limit = max_length - (len(org) + 1 if org else 0)
if len(name) > limit:
digest = hashlib.sha1(name.encode()).hexdigest()[:8]
keep = max(1, limit - len(digest))
base = name[:keep].rstrip("-.")
if not base:
base = "r"
name = f"{base}{digest}" # no '-' before hash
name = collapse(name).strip("-.")

# final cleanup
name = collapse(name).strip("-.")
if name[0] in "-.":
name = "r" + name[1:]
if name[-1] in "-.":
name = name[:-1] + "0"

return f"{org}/{name}" if org else name


run_dir = os.environ["RUN_DIR"]
run_tag = os.environ["RUN_TAG"]
username = os.environ.get("UPLOAD_USERNAME", "negin")
error_mode= os.environ.get("UPLOAD_MODE", "skip_on_error")
hf_repo_id = sanitize_hf_repo_id(f"DCAgent2/{run_tag}")
hf_token = os.environ["HF_TOKEN"]
print(f"[uploader] upload_eval_results(path={run_dir!r}, username={username!r}, error_mode={error_mode!r}, hf_repo_id={hf_repo_id!r})")
upload_eval_results(run_dir, username=username, error_mode=error_mode, hf_token=hf_token, hf_repo_id=hf_repo_id, register_benchmark=True)
print("[uploader] done.")
PY

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script includes a very large and complex inline Python script for uploading results. Embedding significant logic within a bash script heredoc makes the code difficult to read, maintain, test, and debug. It also duplicates logic (like sanitize_hf_repo_id) that might be useful elsewhere. It would be much better to extract this Python code into a separate, standalone script that can be called from this sbatch script. This applies to the other Python blocks in this file as well (lines 175-223, 231-245, 334-362, 384-398).

Comment on lines +497 to +527
except ValueError as e:
# Check if the error is specifically the JSON 'inf' compliant error
if "Out of range float values" in str(e):
logger.warning("Detected non-JSON compliant floats (inf/nan). Applying patch...")

import math
def sanitize(obj):
if isinstance(obj, dict):
return {k: sanitize(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [sanitize(v) for v in obj]
elif isinstance(obj, float):
if math.isinf(obj):
logger.info("Replacing 'inf' float with string 'inf'")
return "inf"
if math.isnan(obj):
logger.info("Replacing 'nan' float with string 'nan'")
return "nan"
return obj

# Sanitize and retry
model_data = sanitize(model_data)
client = get_supabase_client(use_admin=True)
response = client.table('models').insert(model_data).execute()
if not response.data:
raise ValueError("Failed to create model")

return clean_model_metadata(response.data[0])
else:
# Re-raise if it's a different ValueError
raise e

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sanitize function and import math are defined within the except ValueError block. It's better practice to define helper functions and imports at the module level for clarity, reusability, and to avoid re-definition on every exception. Additionally, this sanitization logic for inf/nan values is only applied to create_model. Other functions that handle JSONB data, such as update_model or functions creating/updating sandbox_jobs with config, metrics, or stats fields, could also benefit from this sanitization to prevent similar errors.

Comment on lines +619 to +626
# Validate duplicate_of if provided
if duplicate_of:
client = get_supabase_client()
target = client.table('models').select('id, duplicate_of').eq('id', duplicate_of).execute()
if not target.data:
return {"success": False, "error": f"duplicate_of target model {duplicate_of} not found"}
if target.data[0].get('duplicate_of'):
return {"success": False, "error": f"duplicate_of target {duplicate_of} is itself a duplicate. Only point to canonical entries."}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This validation logic for the duplicate_of field is repeated in register_hf_model, register_local_model (lines 834-841), and register_benchmark (lines 1453-1460). To improve maintainability and reduce code duplication, consider extracting this logic into a reusable helper function. The function could take the table name and the ID to validate as arguments.

trap cleanup EXIT

# Wait for VLLM server to start with healthcheck
RETRY_INTERVAL=100

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The RETRY_INTERVAL for the VLLM server healthcheck is set to 100 seconds. This seems excessively long for a retry interval and might be a typo. A shorter interval, such as 10 or 15 seconds, would allow for faster job startup or failure detection.

Suggested change
RETRY_INTERVAL=100
RETRY_INTERVAL=10

@penfever
Copy link
Collaborator

penfever commented Mar 5, 2026

@richardzhuang0412 , can you please update this to the new v3 listener, address the Gemini notes and merge the latest from main? Would be great to get this merged soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants