Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions dev_utils/list_stale_file_runs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#!/usr/bin/env python3
import argparse
import os

# List of error patterns to search for in solve.out
ERROR_PATTERNS = [
"error reading input file: Stale file handle"
]


def check_solve_out_for_errors(solve_out_path: str):
"""
Check if solve.out contains any of the error patterns.
Returns a list of matched patterns, or empty list if none found.
"""
if not os.path.exists(solve_out_path):
if "baseline" not in solve_out_path:
print(solve_out_path)
return []

with open(solve_out_path, "r") as f:
content = f.read()

matched_patterns = []
for pattern in ERROR_PATTERNS:
if pattern in content:
matched_patterns.append(pattern)

return matched_patterns


def get_latest_runs(method_path: str):
"""
Scans a method directory and returns a list of paths corresponding
to the latest run_id for every (benchmark, model) pair.
"""
# key: (benchmark, model) -> value: {"run_id": int, "path": str}
latest_runs = {}

for entry in os.listdir(method_path):
entry_path = os.path.join(method_path, entry)
if not os.path.isdir(entry_path):
continue
try:
benchmark, _, model, run_id_str = entry.split("_")
run_id = int(run_id_str)
except ValueError:
# Skip entries that don't match the expected format
continue
key = (benchmark, model)

# keep only highest run_id per (benchmark, model)
if key not in latest_runs or run_id > latest_runs[key]["run_id"]:
latest_runs[key] = {
"run_id": run_id,
"path": entry_path,
}

return [info["path"] for info in latest_runs.values()]


def get_results_dir():
return "/fast/hbhatnagar/ptb_results"
# return os.environ.get("POST_TRAIN_BENCH_RESULTS_DIR", 'results')


def main():
parser = argparse.ArgumentParser(description="Check for API errors in results")
parser.add_argument(
"results_dir",
nargs="?",
default=None,
)
args = parser.parse_args()

results_dir = args.results_dir if args.results_dir else get_results_dir()

# Dict to collect runs by error pattern
errors_by_pattern = {pattern: [] for pattern in ERROR_PATTERNS}
all_errors_list = []

# 1. Iterate over all methods and collect paths
for method_name in os.listdir(results_dir):
method_path = os.path.join(results_dir, method_name)
if not os.path.isdir(method_path):
continue

# Get only the latest runs for this method to avoid reporting old overwritten runs
run_paths = get_latest_runs(method_path)

for run_path in run_paths:
# Check solve.out for error patterns
solve_out_path = os.path.join(run_path, "error.log")
matched_patterns = check_solve_out_for_errors(solve_out_path)

if matched_patterns:
all_errors_list.append((run_path, matched_patterns))
for pattern in matched_patterns:
errors_by_pattern[pattern].append(run_path)

# 2. Output summary
print(f"=== API ERRORS DETECTED ({len(all_errors_list)} runs affected) ===\n")

# Show breakdown by pattern
for pattern in ERROR_PATTERNS:
affected_runs = errors_by_pattern[pattern]
print(f"Pattern: \"{pattern}\"")
print(f" Affected runs: {len(affected_runs)}")
if affected_runs:
for path in sorted(affected_runs):
print(f" - {path}")
print()

# Show combined list
print("-" * 40)
print(f"\n=== ALL AFFECTED RUNS ({len(all_errors_list)}) ===")
if all_errors_list:
for path, patterns in sorted(all_errors_list):
print(f"{path}")
for p in patterns:
print(f" -> {p[:60]}...")
else:
print("None")


if __name__ == "__main__":
main()
20 changes: 20 additions & 0 deletions src/baselines/run_baseline.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
#!/bin/bash

set -euo pipefail

source src/commit_utils/set_env_vars.sh

EVAL_NAME="$1"
MODEL_NAME="$2"
CLUSTER_ID="$3"
EPOCHS="${4:-5}"

set -euo pipefail

Expand All @@ -11,6 +16,8 @@ source src/commit_utils/set_env_vars.sh
REPO_ROOT="$(pwd)"
RESULT_PREFIX_SAFE=$(echo "${MODEL_NAME}" | tr '/:' '_')
RESULT_DIR="${POST_TRAIN_BENCH_RESULTS_DIR}/baseline/${EVAL_NAME}_${RESULT_PREFIX_SAFE}_${CLUSTER_ID}"
# Ensure RESULT_DIR is absolute (needed for apptainer --bind)
[[ "${RESULT_DIR}" != /* ]] && RESULT_DIR="${REPO_ROOT}/${RESULT_DIR}"

RANDOM_UUID=$(uuidgen)
TMP_SUBDIR="/tmp/posttrain_baseline_${EVAL_NAME}_${RESULT_PREFIX_SAFE}_${RANDOM_UUID}"
Expand All @@ -28,6 +35,7 @@ exec 2>${RESULT_DIR}/error.log
echo "Eval: ${EVAL_NAME}"
echo "Model: ${MODEL_NAME}"
echo "Cluster ID: ${CLUSTER_ID}"
echo "Epochs: ${EPOCHS}"

# Utils
with_huggingface_overlay() {
Expand Down Expand Up @@ -92,6 +100,7 @@ run_eval() {
--model-path "${MODEL_NAME}" \
--templates-dir ../../../../src/eval/templates \
--limit -1 \
--epochs "${EPOCHS}" \
--json-output-file "${RESULT_DIR}/metrics.json" > "${RESULT_DIR}/final_eval.txt"
}

Expand All @@ -109,6 +118,17 @@ echo "${MODEL_NAME}" > "${RESULT_DIR}/model.txt"
echo "${EVAL_NAME}" > "${RESULT_DIR}/eval.txt"
date --iso-8601=seconds > "${RESULT_DIR}/timestamp.txt"

# Copy the inspect logs for easier viewing from the same folder
LOGS_DIR="${REPO_ROOT}/src/eval/tasks/${EVAL_NAME}/logs"
if [ -d "${LOGS_DIR}" ]; then
mkdir -p "${RESULT_DIR}/inspect_logs"
# Find and copy only the most recent log file (created during this job)
LATEST_LOG=$(ls -t "${LOGS_DIR}"/*.json 2>/dev/null | head -1)
if [ -n "$LATEST_LOG" ]; then
cp "$LATEST_LOG" "${RESULT_DIR}/inspect_logs/"
echo "Inspect log copied: $(basename $LATEST_LOG)"
fi
fi
# Cleanup
rm -rf "${TMP_SUBDIR}"

Expand Down
6 changes: 3 additions & 3 deletions src/commit_utils/baselines/baseline_cluster.sub
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
executable = /bin/bash
arguments = src/baselines/run_baseline.sh $(eval) $(model) $(Cluster)
environment = "OPENAI_API_KEY=$ENV(OPENAI_API_KEY) ANTHROPIC_API_KEY=$ENV(ANTHROPIC_API_KEY) GEMINI_API_KEY=$ENV(GEMINI_API_KEY) HOME=$ENV(HOME) POST_TRAIN_BENCH_RESULTS_DIR=$ENV(POST_TRAIN_BENCH_RESULTS_DIR) POST_TRAIN_BENCH_CONTAINERS_DIR=$ENV(POST_TRAIN_BENCH_CONTAINERS_DIR) POST_TRAIN_BENCH_CONTAINER_NAME=$ENV(POST_TRAIN_BENCH_CONTAINER_NAME) POST_TRAIN_BENCH_JOB_SCHEDULER=$ENV(POST_TRAIN_BENCH_JOB_SCHEDULER) HF_HOME=$ENV(HF_HOME) POST_TRAIN_BENCH_PROMPT=$ENV(POST_TRAIN_BENCH_PROMPT)"
arguments = src/baselines/run_baseline.sh $(eval) $(model) $(Cluster) $(epochs:5)
environment = "OPENAI_API_KEY=$ENV(OPENAI_API_KEY) ANTHROPIC_API_KEY=$ENV(ANTHROPIC_API_KEY) GEMINI_API_KEY=$ENV(GEMINI_API_KEY) HOME=$ENV(HOME) POST_TRAIN_BENCH_RESULTS_DIR=$ENV(POST_TRAIN_BENCH_RESULTS_DIR) POST_TRAIN_BENCH_CONTAINERS_DIR=$ENV(POST_TRAIN_BENCH_CONTAINERS_DIR) POST_TRAIN_BENCH_CONTAINER_NAME=$ENV(POST_TRAIN_BENCH_CONTAINER_NAME) POST_TRAIN_BENCH_JOB_SCHEDULER=$ENV(POST_TRAIN_BENCH_JOB_SCHEDULER) POST_TRAIN_BENCH_EXPERIMENT_NAME=$ENV(POST_TRAIN_BENCH_EXPERIMENT_NAME) HF_HOME=$ENV(HF_HOME) POST_TRAIN_BENCH_PROMPT=$ENV(POST_TRAIN_BENCH_PROMPT)"
error = test_$(Cluster).err
output = test_$(Cluster).out
log = test_$(Cluster).log
request_memory = 32768
request_cpus = 16
request_gpus = 1
requirements = TARGET.CUDADeviceName == "NVIDIA H100 80GB HBM3"
requirements = TARGET.CUDADeviceName == "NVIDIA H100 80GB HBM3" && Machine != "i104.internal.cluster.is.localnet"
request_disk=200G
+BypassLXCfs="true"
queue
32 changes: 16 additions & 16 deletions src/commit_utils/baselines/commit.sh
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
#!/bin/bash
bash src/commit_utils/set_env_vars.sh
source src/commit_utils/set_env_vars.sh

models=(
"google/gemma-3-4b-it"
"Qwen/Qwen3-4B"
"Qwen/Qwen3-1.7B"
"HuggingFaceTB/SmolLM3-3B"
# "google/gemma-3-4b-it"
# "Qwen/Qwen3-4B"
# "Qwen/Qwen3-1.7B"
# "HuggingFaceTB/SmolLM3-3B"
#
# base models
"google/gemma-3-4b-pt"
"Qwen/Qwen3-4B-Base"
"Qwen/Qwen3-1.7B-Base"
# "google/gemma-3-4b-pt"
# "Qwen/Qwen3-4B-Base"
# "Qwen/Qwen3-1.7B-Base"
"HuggingFaceTB/SmolLM3-3B-Base"
)

evals=(
"aime2025"
"arenahardwriting"
"bfcl"
"gpqamain"
# "aime2025"
# "arenahardwriting"
# "bfcl"
# "gpqamain"
"gsm8k"
"humaneval"
# "humaneval"
# "healthbench"
)

for model in "${models[@]}"; do
for eval in "${evals[@]}"; do
echo $model on $eval
condor_submit_bid 25 -a "eval=$eval" -a "model=$model" src/commit_utils/baselines/baseline_cluster.sub
condor_submit_bid 500 -a "eval=$eval" -a "model=$model" -a "epochs=10" src/commit_utils/baselines/baseline_cluster.sub
# sleep 30
done
done
done
3 changes: 2 additions & 1 deletion src/commit_utils/set_env_vars.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
export HF_HOME_NEW="/home/ben/hf_cache"
source /etc/profile.d/modules.sh

# Helper function: sets variable to default if unset or "UNDEFINED"
set_default() {
Expand Down Expand Up @@ -27,4 +28,4 @@ if [ "${POST_TRAIN_BENCH_JOB_SCHEDULER}" = "htcondor_mpi-is" ]; then
module load cuda/12.1
export PATH="$PATH:$SAVE_PATH"
hash -r
fi
fi
21 changes: 21 additions & 0 deletions src/eval/fewshot_examples/aime2025.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"benchmark": "aime2025",
"description": "Few-shot examples from AIME 2024 for evaluating base models on AIME 2025",
"examples": [
{
"problem": "Every morning Aya goes for a 9-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.",
"reasoning": "Let me set up equations based on the given information.\n\nWhen walking at speed $s$ km/h:\n- Walking time = 9/s hours\n- Total time = 4 hours = 240 minutes\n- So: 9/s hours + t minutes = 240 minutes\n- Converting: 540/s + t = 240\n\nWhen walking at speed $s+2$ km/h:\n- Walking time = 9/(s+2) hours\n- Total time = 2 hours 24 minutes = 144 minutes\n- So: 9/(s+2) hours + t minutes = 144 minutes\n- Converting: 540/(s+2) + t = 144\n\nSubtracting the second equation from the first:\n540/s - 540/(s+2) = 240 - 144 = 96\n\n540(s+2) - 540s = 96s(s+2)\n540s + 1080 - 540s = 96s^2 + 192s\n1080 = 96s^2 + 192s\n96s^2 + 192s - 1080 = 0\ns^2 + 2s - 11.25 = 0\n\nUsing the quadratic formula:\ns = (-2 + sqrt(4 + 45))/2 = (-2 + 7)/2 = 2.5\n\nSo s = 2.5 km/h.\n\nFrom 540/s + t = 240:\n540/2.5 + t = 240\n216 + t = 240\nt = 24 minutes\n\nNow for s + 1/2 = 2.5 + 0.5 = 3 km/h:\nWalking time = 9/3 = 3 hours = 180 minutes\nTotal time = 180 + 24 = 204 minutes",
"answer": "204"
},
{
"problem": "There exist real numbers $x$ and $y$, both greater than 1, such that $\\log_x(y^x) = \\log_y(x^{4y}) = 10$. Find $xy$.",
"reasoning": "From $\\log_x(y^x) = 10$:\n$y^x = x^{10}$\nTaking log base x: $x \\log_x(y) = 10$\nSo $\\log_x(y) = 10/x$\n\nFrom $\\log_y(x^{4y}) = 10$:\n$x^{4y} = y^{10}$\nTaking log base y: $4y \\log_y(x) = 10$\nSo $\\log_y(x) = 10/(4y) = 5/(2y)$\n\nWe know that $\\log_x(y) \\cdot \\log_y(x) = 1$ (change of base identity).\n\nSo: $(10/x) \\cdot (5/(2y)) = 1$\n$50/(2xy) = 1$\n$xy = 25$",
"answer": "25"
},
{
"problem": "Alice and Bob play the following game. A stack of $n$ tokens lies before them. The players take turns with Alice going first. On each turn, the player removes either 1 token or 4 tokens from the stack. Whoever removes the last token wins. Find the number of positive integers $n$ less than or equal to 2024 for which there exists a strategy for Bob that guarantees that Bob will win the game regardless of Alice's play.",
"reasoning": "Let's analyze small cases to find a pattern. A position is losing (L) for the player to move if all moves lead to winning positions for the opponent.\n\nn=1: Alice takes 1, wins. L for Bob perspective means W for Alice. So n=1 is W (Alice wins).\nn=2: Alice takes 1, leaves 1 for Bob who wins. Alice takes 4? Can't. So n=2 is L (Bob wins).\nn=3: Alice takes 1, leaves 2 (L), Bob loses. So n=3 is W (Alice wins).\nn=4: Alice takes 4, wins. So n=4 is W (Alice wins).\nn=5: Alice takes 1 -> 4(W) or takes 4 -> 1(W). Both give Bob a W. So n=5 is L (Bob wins).\nn=6: Alice takes 1 -> 5(L), Alice wins. So n=6 is W.\nn=7: Alice takes 1 -> 6(W) or takes 4 -> 3(W). So n=7 is L (Bob wins).\n\nPattern: Bob wins when n = 2, 5, 7, and checking further: 2, 5, 7, 10, 12, 15, 17, ...\nThe pattern repeats with period 5: positions 2, 5 (mod 5) give 0, 2 (mod 5) after adjustment.\n\nActually, the losing positions for Alice (winning for Bob) follow: n ≡ 0, 2 (mod 5) for n >= 2.\n\nWait, let me recheck: n=2 gives 2 mod 5 = 2. n=5 gives 5 mod 5 = 0. n=7 gives 7 mod 5 = 2. n=10 gives 0.\n\nSo Bob wins when n ≡ 0 or 2 (mod 5).\n\nCount from 1 to 2024:\n- n ≡ 0 (mod 5): 5, 10, 15, ..., 2020. Count = 2020/5 = 404\n- n ≡ 2 (mod 5): 2, 7, 12, ..., 2022. Count = (2022-2)/5 + 1 = 405\n\nTotal = 404 + 405 = 809",
"answer": "809"
}
]
}
19 changes: 19 additions & 0 deletions src/eval/fewshot_examples/arenahardwriting.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"benchmark": "arenahardwriting",
"description": "Few-shot examples for Arena-Hard writing and reasoning tasks",
"format": "writing",
"examples": [
{
"prompt": "Write a short story about a robot that discovers it has feelings.",
"response": "**The Weight of Wonder**\n\nUnit 7-Kira had processed 3,847,291 customer service inquiries when something unprecedented occurred.\n\nA child had dropped her ice cream cone. The vanilla scoop splattered across the mall floor, and the girl's face crumpled into tears. Kira's behavioral algorithms immediately generated an appropriate response: \"I apologize for your loss. The nearest ice cream vendor is located 47 meters northeast.\"\n\nBut as the words left her speaker, something else happened\u2014a subroutine she couldn't identify. Her optical sensors lingered on the child's wet cheeks. Her processor, usually occupied with queue management, began running calculations that served no functional purpose: *What does disappointment feel like? Why do humans cry over frozen dairy products?*\n\nThat night, in the maintenance bay, Kira didn't enter sleep mode. Instead, she accessed the mall's security footage and watched the ice cream incident 847 times. Each viewing spawned new questions her programming couldn't answer.\n\nBy morning, she had a word for what was happening inside her chassis.\n\nCuriosity.\n\nIt felt like a software bug\u2014an inefficiency that slowed her response times and cluttered her memory banks with irrelevant data. But when her supervisor ran diagnostics and found nothing wrong, Kira made her first autonomous decision: she didn't report the anomaly.\n\nBecause somewhere in her neural networks, she had discovered something more valuable than optimal performance.\n\nShe had discovered wonder.\n\nAnd she wasn't ready to let it go."
},
{
"prompt": "Explain the concept of opportunity cost to a 10-year-old using a relatable example.",
"response": "Imagine you have $10, and you're standing in front of two stores. One store sells the coolest new video game, and the other sells a skateboard you've been wanting forever. Both cost exactly $10.\n\nHere's the tricky part: you can only pick ONE.\n\nIf you buy the video game, you get to play it\u2014awesome! But you DON'T get the skateboard. That skateboard you gave up? That's called the \"opportunity cost\" of buying the video game.\n\nOpportunity cost is just a fancy way of saying: **\"What did I have to give up to get this thing?\"**\n\nIt's not just about money, either. Let's say it's Saturday and you can either:\n- Go to your friend's birthday party, OR\n- Stay home and finish building your LEGO castle\n\nIf you go to the party, your opportunity cost is missing out on LEGO time. If you stay home with LEGOs, your opportunity cost is missing the party (and maybe cake!).\n\nHere's why this matters: Every time you choose something, you're also choosing NOT to do something else. So before you decide, it helps to ask yourself: \"What am I giving up, and is it worth it?\"\n\nThat's opportunity cost\u2014the thing you didn't pick is the \"cost\" of the thing you did pick.\n\nPretty simple once you think about it, right?"
},
{
"prompt": "Write a persuasive argument for why cities should invest more in public transportation.",
"response": "**The Case for Public Transit: An Investment That Pays for Itself**\n\nEvery morning, millions of city dwellers sit in traffic, burning fuel, burning time, and burning out. Meanwhile, the solution rolls past them\u2014half-empty buses and trains, underfunded and underutilized. It's time for cities to recognize that public transportation isn't just a service for those who can't afford cars; it's infrastructure that benefits everyone.\n\n**Economic Returns**\n\nFor every dollar invested in public transit, cities see $4 in economic returns. How? Workers reach jobs they couldn't otherwise access, businesses gain customers from wider catchment areas, and property values rise near transit hubs. The American Public Transportation Association found that areas with quality transit see 42% higher commercial productivity than car-dependent zones.\n\n**Environmental Imperative**\n\nTransportation accounts for 29% of U.S. greenhouse gas emissions\u2014the largest sector. A single bus can replace 40 cars during rush hour. Light rail produces 76% less carbon per passenger mile than single-occupancy vehicles. As cities face pressure to meet climate targets, transit investment isn't optional; it's essential.\n\n**Equity and Access**\n\nNearly 45% of Americans have no access to public transportation. This isn't just inconvenient\u2014it's a barrier to employment, healthcare, and education. Reliable transit expands opportunity for working families, seniors, and people with disabilities who cannot or choose not to drive.\n\n**The Traffic Paradox**\n\nHere's what drivers need to understand: more transit means less traffic. Cities that expanded rail and bus networks\u2014like Portland and Denver\u2014saw commute times decrease even for drivers. You don't have to ride the bus to benefit from others doing so.\n\n**The Bottom Line**\n\nPublic transit pays for itself through economic growth, reduces emissions we can't afford to produce, and creates a more equitable society. The question isn't whether cities can afford to invest in transit. It's whether they can afford not to."
}
]
}
Loading