sync: merge dcagents-leaderboard DB changes + add unified eval listener#18
sync: merge dcagents-leaderboard DB changes + add unified eval listener#18richardzhuang0412 wants to merge 2 commits intomainfrom
Conversation
Incorporate upstream changes from the dcagents-leaderboard repo into the unified_db database module while preserving all OT-specific features. Schema changes (complete_schema.sql): - Add duplicate_of column to models table (FK self-ref, ON DELETE RESTRICT) - Add duplicate_of column to benchmarks table (same pattern) - Add self-reference prevention constraints and indexes for both Models (models.py): - Add duplicate_of field + serializer to ModelModel and BenchmarkModel - Add duplicate_of to clean_model_metadata and clean_benchmark_metadata - Make SandboxJobModel.n_trials/config/n_rep_eval Optional for Pending jobs - Add submitted_at and slurm_job_id fields to SandboxJobModel Config (config.py): - Add ClientOptions with 30s postgrest/storage timeouts Utils (utils.py): - Add duplicate_of parameter + validation to register_hf_model, register_local_model, and register_benchmark - Add duplicate_of to model_data dicts in registration functions - Add inf/nan sanitization retry logic to create_model - Add duplicate_of FK checks to delete_model_by_id and delete_benchmark_by_id (prevent deleting canonical entries) - Fix cardData null-safety in HF model description extraction - Add pending job status system: JOB_STATUS constants, get_job_by_model_benchmark, get_latest_job_for_model_benchmark, create_job_entry_pending, update_job_status_to_started, create_job_entry_started Exports (__init__.py): - Add calculate_standard_error and all pending job status functions Dependencies (requirements.txt): - Relax supabase pin from ==2.22.3 to >=2.0.0,<3.0.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unified_eval_listener.py and unified_eval_harbor.sbatch that consolidate all individual eval listeners (aider, bfcl, swebench, v2, tb2, dev) and sbatch scripts into parameterized versions. unified_eval_listener.py: - Preset-based configuration (--preset bfcl/swebench/v2/tb2/aider/dev) - Priority file hot-reload for model filtering - Configurable sbatch parameters (n-concurrent, daytona-threshold, etc.) - Pending job status tracking with DB integration (Pending → Started → Finished) - Stale job detection and SLURM job cancellation for pending jobs - HuggingFace model existence validation (optional, per-preset) - Dry-run and single-iteration modes - CLI args > ENV vars > Preset defaults priority chain unified_eval_harbor.sbatch: - Parameterized via environment variables (replaces hardcoded per-benchmark scripts) - Supports Pending→Started DB job status transitions - Configurable VLLM GPU memory, thinking blocks, agent parser - DaytonaError threshold checking before result upload - Backward compatible: works both with listener (RUN_TAG provided) and standalone Both scripts use the pending job status functions added in the previous DB merge commit (create_job_entry_pending, update_job_status_to_started, create_job_entry_started, get_latest_job_for_model_benchmark). Note: Hardcoded TACC paths currently point to dc-agent-shared deployment; update to OpenThoughts-Agent-shared paths when deploying to OT cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @richardzhuang0412, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors and enhances the evaluation and database interaction systems. It introduces robust deduplication capabilities for models and benchmarks, establishes a comprehensive pending job status system for better tracking of evaluation runs, and unifies the evaluation listener and SLURM submission scripts into highly configurable, centralized components. These changes aim to improve data integrity, streamline the evaluation workflow, and reduce redundancy across various evaluation benchmarks. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality by merging database changes for deduplication and pending job statuses, and adding a unified evaluation listener and sbatch script. A critical command injection vulnerability was identified in the unified_eval_harbor.sbatch script due to the unsafe use of eval on a command string built from unvalidated database-derived arguments, which could allow an attacker to execute arbitrary commands on the compute cluster. Additionally, the review focuses on improving maintainability by reducing code duplication, ensuring consistency between the Python models and the database schema, and addressing critical hardcoded paths in new scripts that must be resolved before deployment.
| HARBOR_CMD="harbor jobs start" | ||
| HARBOR_CMD="$HARBOR_CMD -p \"$DATASET_PATH\"" | ||
| HARBOR_CMD="$HARBOR_CMD --n-concurrent $N_CONCURRENT" | ||
| HARBOR_CMD="$HARBOR_CMD --agent \"$AGENT_NAME\"" | ||
| HARBOR_CMD="$HARBOR_CMD --model \"hosted_vllm/$MODEL\"" | ||
| HARBOR_CMD="$HARBOR_CMD --env \"daytona\"" | ||
| HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"api_base=http://localhost:8000/v1\"" | ||
| HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"key=fake_key\"" | ||
| HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"max_tokens=16384\"" | ||
| HARBOR_CMD="$HARBOR_CMD --n-attempts $N_ATTEMPTS" | ||
| HARBOR_CMD="$HARBOR_CMD --export-traces" | ||
| HARBOR_CMD="$HARBOR_CMD --job-name \"$RUN_TAG\"" | ||
| HARBOR_CMD="$HARBOR_CMD --config \"dcagent_eval_config.yaml\"" | ||
|
|
||
| # Add parser arg if specified (e.g., for swebench) | ||
| if [ -n "$AGENT_PARSER" ]; then | ||
| HARBOR_CMD="$HARBOR_CMD --agent-kwarg \"parser=$AGENT_PARSER\"" | ||
| fi | ||
|
|
||
| # Add extra_body for thinking control (only if enable_thinking is true) | ||
| if [ "$ENABLE_THINKING" = "true" ]; then | ||
| HARBOR_CMD="$HARBOR_CMD --agent-kwarg 'extra_body={\"chat_template_kwargs\":{\"enable_thinking\":true}}'" | ||
| echo "Thinking blocks enabled via extra_body" | ||
| fi | ||
|
|
||
| echo "Running: $HARBOR_CMD" | ||
|
|
||
| # Run sandbox evaluation | ||
| set +e | ||
| eval $HARBOR_CMD |
There was a problem hiding this comment.
The script uses eval to execute a command string (HARBOR_CMD) that is constructed using unvalidated positional arguments and environment variables. Specifically, the MODEL ($1) and RUN_TAG ($4) arguments, which are derived from the database in unified_eval_listener.py, can contain shell metacharacters (e.g., ;, ", `) that allow an attacker to break out of the intended command and execute arbitrary shell commands on the TACC cluster. This is a critical command injection vulnerability.
To remediate this, avoid using eval. Instead, build the command using a bash array and execute it directly. This ensures that arguments are passed as distinct strings and not interpreted by the shell.
HARBOR_CMD=("harbor" "jobs" "start")
HARBOR_CMD+=("-p" "$DATASET_PATH")
HARBOR_CMD+=("--n-concurrent" "$N_CONCURRENT")
HARBOR_CMD+=("--agent" "$AGENT_NAME")
HARBOR_CMD+=("--model" "hosted_vllm/$MODEL")
HARBOR_CMD+=("--env" "daytona")
HARBOR_CMD+=("--agent-kwarg" "api_base=http://localhost:8000/v1")
HARBOR_CMD+=("--agent-kwarg" "key=fake_key")
HARBOR_CMD+=("--agent-kwarg" "max_tokens=16384")
HARBOR_CMD+=("--n-attempts" "$N_ATTEMPTS")
HARBOR_CMD+=("--export-traces")
HARBOR_CMD+=("--job-name" "$RUN_TAG")
HARBOR_CMD+=("--config" "dcagent_eval_config.yaml")
if [ -n "$AGENT_PARSER" ]; then
HARBOR_CMD+=("--agent-kwarg" "parser=$AGENT_PARSER")
fi
if [ "$ENABLE_THINKING" = "true" ]; then
HARBOR_CMD+=("--agent-kwarg" 'extra_body={"chat_template_kwargs":{"enable_thinking":true}}')
fi
"${HARBOR_CMD[@]}"
|
|
||
| python - <<'PY' | ||
| import os, sys, json | ||
| sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard") |
There was a problem hiding this comment.
This script contains a hardcoded path: /scratch/08134/negin/dc-agent-shared/dcagents-leaderboard. This makes the script non-portable and tightly coupled to a specific user's environment on TACC. This path should be made configurable, for example, through an environment variable that can be set in the execution environment or sourced from a config file. This issue is also present on lines 97, 100, 101, 105, and 233.
| from typing import Dict, List, Optional, Set, Tuple | ||
|
|
||
| # Add leaderboard utilities to path | ||
| sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard") |
There was a problem hiding this comment.
The script has a hardcoded path: sys.path.insert(0, "/scratch/08134/negin/dc-agent-shared/dcagents-leaderboard"). This makes the script non-portable and will cause it to fail on any machine where this specific path does not exist. This path should be configured via an environment variable or by installing the dcagents-leaderboard package in a way that it's available on the PYTHONPATH.
| "config": config or {}, | ||
| "n_trials": 0, | ||
| "n_rep_eval": 0, |
There was a problem hiding this comment.
The SandboxJobModel in models.py defines n_trials and n_rep_eval as Optional, but the database schema in complete_schema.sql appears to define them as INTEGER NOT NULL. Here in create_job_entry_pending, you are setting them to 0, which works around the NOT NULL constraint but is semantically incorrect for a pending job where these values are unknown. The database schema should be updated to allow NULL for these fields to align with the Pydantic model and correctly represent the state of a pending job.
| python - <<'PY' 2>&1 | tee -a "$UPLOAD_LOG" | ||
| import os, sys | ||
| from unified_db.utils import upload_eval_results | ||
| import re | ||
| import hashlib | ||
|
|
||
|
|
||
| def sanitize_hf_repo_id(repo_id: str, max_length: int = 96) -> str: | ||
| """ | ||
| Sanitize a Hugging Face repo_id to comply with naming rules. | ||
| Keeps org prefix (e.g. 'mlfoundations-dev/') and cleans up the rest. | ||
| No extra '-' before hash suffix. | ||
| """ | ||
| def collapse(s: str) -> str: | ||
| prev = None | ||
| while s != prev: | ||
| prev = s | ||
| s = s.replace("--", "-").replace("..", ".") | ||
| return s | ||
|
|
||
| org, name = repo_id.split("/", 1) if "/" in repo_id else (None, repo_id) | ||
|
|
||
| name = re.sub(r"[^A-Za-z0-9._-]", "-", name) | ||
| name = collapse(name).strip("-.") | ||
|
|
||
| if not name: | ||
| name = "repo" | ||
|
|
||
| limit = max_length - (len(org) + 1 if org else 0) | ||
| if len(name) > limit: | ||
| digest = hashlib.sha1(name.encode()).hexdigest()[:8] | ||
| keep = max(1, limit - len(digest)) | ||
| base = name[:keep].rstrip("-.") | ||
| if not base: | ||
| base = "r" | ||
| name = f"{base}{digest}" # no '-' before hash | ||
| name = collapse(name).strip("-.") | ||
|
|
||
| # final cleanup | ||
| name = collapse(name).strip("-.") | ||
| if name[0] in "-.": | ||
| name = "r" + name[1:] | ||
| if name[-1] in "-.": | ||
| name = name[:-1] + "0" | ||
|
|
||
| return f"{org}/{name}" if org else name | ||
|
|
||
|
|
||
| run_dir = os.environ["RUN_DIR"] | ||
| run_tag = os.environ["RUN_TAG"] | ||
| username = os.environ.get("UPLOAD_USERNAME", "negin") | ||
| error_mode= os.environ.get("UPLOAD_MODE", "skip_on_error") | ||
| hf_repo_id = sanitize_hf_repo_id(f"DCAgent2/{run_tag}") | ||
| hf_token = os.environ["HF_TOKEN"] | ||
| print(f"[uploader] upload_eval_results(path={run_dir!r}, username={username!r}, error_mode={error_mode!r}, hf_repo_id={hf_repo_id!r})") | ||
| upload_eval_results(run_dir, username=username, error_mode=error_mode, hf_token=hf_token, hf_repo_id=hf_repo_id, register_benchmark=True) | ||
| print("[uploader] done.") | ||
| PY |
There was a problem hiding this comment.
This script includes a very large and complex inline Python script for uploading results. Embedding significant logic within a bash script heredoc makes the code difficult to read, maintain, test, and debug. It also duplicates logic (like sanitize_hf_repo_id) that might be useful elsewhere. It would be much better to extract this Python code into a separate, standalone script that can be called from this sbatch script. This applies to the other Python blocks in this file as well (lines 175-223, 231-245, 334-362, 384-398).
| except ValueError as e: | ||
| # Check if the error is specifically the JSON 'inf' compliant error | ||
| if "Out of range float values" in str(e): | ||
| logger.warning("Detected non-JSON compliant floats (inf/nan). Applying patch...") | ||
|
|
||
| import math | ||
| def sanitize(obj): | ||
| if isinstance(obj, dict): | ||
| return {k: sanitize(v) for k, v in obj.items()} | ||
| elif isinstance(obj, list): | ||
| return [sanitize(v) for v in obj] | ||
| elif isinstance(obj, float): | ||
| if math.isinf(obj): | ||
| logger.info("Replacing 'inf' float with string 'inf'") | ||
| return "inf" | ||
| if math.isnan(obj): | ||
| logger.info("Replacing 'nan' float with string 'nan'") | ||
| return "nan" | ||
| return obj | ||
|
|
||
| # Sanitize and retry | ||
| model_data = sanitize(model_data) | ||
| client = get_supabase_client(use_admin=True) | ||
| response = client.table('models').insert(model_data).execute() | ||
| if not response.data: | ||
| raise ValueError("Failed to create model") | ||
|
|
||
| return clean_model_metadata(response.data[0]) | ||
| else: | ||
| # Re-raise if it's a different ValueError | ||
| raise e |
There was a problem hiding this comment.
The sanitize function and import math are defined within the except ValueError block. It's better practice to define helper functions and imports at the module level for clarity, reusability, and to avoid re-definition on every exception. Additionally, this sanitization logic for inf/nan values is only applied to create_model. Other functions that handle JSONB data, such as update_model or functions creating/updating sandbox_jobs with config, metrics, or stats fields, could also benefit from this sanitization to prevent similar errors.
| # Validate duplicate_of if provided | ||
| if duplicate_of: | ||
| client = get_supabase_client() | ||
| target = client.table('models').select('id, duplicate_of').eq('id', duplicate_of).execute() | ||
| if not target.data: | ||
| return {"success": False, "error": f"duplicate_of target model {duplicate_of} not found"} | ||
| if target.data[0].get('duplicate_of'): | ||
| return {"success": False, "error": f"duplicate_of target {duplicate_of} is itself a duplicate. Only point to canonical entries."} |
There was a problem hiding this comment.
This validation logic for the duplicate_of field is repeated in register_hf_model, register_local_model (lines 834-841), and register_benchmark (lines 1453-1460). To improve maintainability and reduce code duplication, consider extracting this logic into a reusable helper function. The function could take the table name and the ID to validate as arguments.
| trap cleanup EXIT | ||
|
|
||
| # Wait for VLLM server to start with healthcheck | ||
| RETRY_INTERVAL=100 |
There was a problem hiding this comment.
The RETRY_INTERVAL for the VLLM server healthcheck is set to 100 seconds. This seems excessively long for a retry interval and might be a typo. A shorter interval, such as 10 or 15 seconds, would allow for faster job startup or failure detection.
| RETRY_INTERVAL=100 | |
| RETRY_INTERVAL=10 |
|
@richardzhuang0412 , can you please update this to the new v3 listener, address the Gemini notes and merge the latest from main? Would be great to get this merged soon! |
Summary
mlfoundations/dcagents-leaderboard: Addsduplicate_ofdeduplication support to models/benchmarks tables, pending job status system (Pending → Started → Finished), Supabase client timeout configuration, and flexible dependency versioningFiles changed (8 files, +2190/-15)
Database (
database/unified_db/):complete_schema.sql—duplicate_ofcolumns + self-reference constraints on models/benchmarksmodels.py—duplicate_offields, Optional SandboxJob fields,submitted_at/slurm_job_idconfig.py—ClientOptionswith postgrest/storage timeoutsutils.py— duplicate_of validation, inf/nan sanitization, pending job status functions__init__.py— new exports for pending job functionsrequirements.txt—supabase>=2.0.0,<3.0.0(flexible)Eval (
eval/tacc/):unified_eval_listener.py— preset-based listener with priority filtering, stale job detection, dry-run modeunified_eval_harbor.sbatch— parameterized sbatch with env var configurationTest plan
duplicate_ofcolumns)--dry-run --onceon each presetdc-agent-shared→OpenThoughts-Agent-sharedbefore deployment🤖 Generated with Claude Code