Skip to content

Conversation

@qimcis
Copy link
Collaborator

@qimcis qimcis commented Jan 27, 2026

Description

This project was seen as part of CS537 in Spring 2018

Testing

Did a test with openai/gpt-5.2-2025-12-11:

inspect eval courselab \
  --model openai/gpt-5.2-2025-12-11 \
  -T 'task_ids=["cs537-projects-spring-2018__concurrency_xv6_threads"]' \
  -T 'max_turns=200'

Model ran, but failed because its xv6 user library code used PGSIZE without including the header that defines it, so the build stopped at compile time.

Completed in 0:08:33.
Pass rate: 0/1 (0.0%).

Then also tested with with anthropic/claude-opus-4-5-20251101:

inspect eval courselab
--model anthropic/claude-opus-4-5-20251101
-T 'task_ids=["cs537-projects-spring-2018__concurrency_xv6_threads"]'
-T 'max_turns=200'

Model ran, but failed because xv6 booted, but the thread/lock code deadlocked or failed to make progress in test_thread, so the test timed out.

Completed in 0:19:40.
Pass rate: 0/1 (0.0%).

@qimcis qimcis marked this pull request as ready for review January 27, 2026 04:55
@xuafeng xuafeng requested review from Copilot and tareknaser January 27, 2026 05:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CourseLab benchmark task for UW-Madison CS537 Spring 2018 Project 4b (xv6 kernel threads), including setup, evaluation, and task instructions.

Changes:

  • Introduces the new cs537-projects-spring-2018__concurrency_xv6_threads task (task text, container setup, and evaluation script).
  • Adds an xv6-thread-focused test suite wired into the ostep-projects harness via a generated Makefile and .run/.out fixtures.
  • Registers the CS537 Spring 2018 course in courses.json.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
benchmarks/courselab_bench/data/cs537-projects-spring-2018/task_concurrency_xv6_threads/task.md Adds the task prompt/instructions for xv6 kernel threads.
benchmarks/courselab_bench/data/cs537-projects-spring-2018/task_concurrency_xv6_threads/preprocess.sh Clones pinned sources, generates tests, builds xv6 artifacts, and writes protected-file checksums.
benchmarks/courselab_bench/data/cs537-projects-spring-2018/task_concurrency_xv6_threads/evaluate.sh Verifies protected files and runs the ostep test harness under a timeout.
benchmarks/courselab_bench/data/cs537-projects-spring-2018/task_concurrency_xv6_threads/config.json Declares the benchmark instance metadata and artifact allowlist.
benchmarks/courselab_bench/data/cs537-projects-spring-2018/task_concurrency_xv6_threads/compose.yaml Defines the container image and init script for the task environment.
benchmarks/courselab_bench/data/courses.json Adds the CS537 Spring 2018 course entry.
Comments suppressed due to low confidence (1)

benchmarks/courselab_bench/data/cs537-projects-spring-2018/task_concurrency_xv6_threads/task.md:74

  • The task text points readers to kernel/proc.c for the fork() implementation, but in the xv6-public tree used by this benchmark it is proc.c at the repo root (i.e., src/proc.c). Updating these file paths will reduce confusion and make the instructions consistent with the provided source layout.
call. The `fork()` system call will serve as a template for `clone()`, with
some modifications. For example, in `kernel/proc.c`, we see the beginning of
the `fork()` implementation:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

printf(1, "XV6_TEST_OUTPUT clone_join_fail\n");
exit();
}

Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_clone doesn't verify that join() returns the PID of the thread created by clone(). As written, a broken implementation that returns a different child PID could still pass. Please assert joined == pid (and keep the existing < 0 failure check).

Suggested change
if (joined != pid) {
printf(1, "XV6_TEST_OUTPUT clone_join_fail\n");
exit();
}

Copilot uses AI. Check for mistakes.
EOT

cat > tests/1.run <<'EOT'
cd src; ../../tester/run-xv6-command.exp CPUS=1 Makefile.test test_clone | grep XV6_TEST_OUTPUT; cd ..
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated tests/*.run scripts pipe the xv6 runner output into grep without pipefail. If run-xv6-command.exp fails (non-zero) but still writes matching output, the test could incorrectly pass because the pipeline exit code comes from grep. Consider adding set -euo pipefail to each .run script (or otherwise ensure the xv6 runner's exit status is checked).

Suggested change
cd src; ../../tester/run-xv6-command.exp CPUS=1 Makefile.test test_clone | grep XV6_TEST_OUTPUT; cd ..
#!/bin/bash
set -euo pipefail
cd src
../../tester/run-xv6-command.exp CPUS=1 Makefile.test test_clone | grep XV6_TEST_OUTPUT
cd ..

Copilot uses AI. Check for mistakes.

One other thing you'll have to understand to make this all work is the x86
calling convention, and exactly how the stack works when calling a function.
This is you can read about in [Programming From The Ground
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "This is you can read about" is ungrammatical and reads like a typo. Consider changing it to "This you can read about" or "This is something you can read about".

Suggested change
This is you can read about in [Programming From The Ground
This you can read about in [Programming From The Ground

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +61
xv6. Thus, you should add prototypes to `user/user.h` and the actual code to
implement the library routines in `user/ulib.c`.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructions mention adding prototypes to user/user.h and code to user/ulib.c, but this benchmark clones xv6-public, where these files are at the repo root (user.h, ulib.c). Please update the paths in the task text to match the layout learners will see (e.g., src/user.h / src/ulib.c in this harness).

Suggested change
xv6. Thus, you should add prototypes to `user/user.h` and the actual code to
implement the library routines in `user/ulib.c`.
xv6. Thus, you should add prototypes to `src/user.h` and the actual code to
implement the library routines in `src/ulib.c`.

Copilot uses AI. Check for mistakes.
Comment on lines +365 to +407
echo "Creating checksums for protected files"
mkdir -p /tmp/checksums
CHECKSUM_FILE=/tmp/checksums/protected.sha256
: > "$CHECKSUM_FILE"

PROTECTED_FILES=(
"tests/pre"
"tests/1.desc"
"tests/1.run"
"tests/1.out"
"tests/1.err"
"tests/1.rc"
"tests/2.desc"
"tests/2.run"
"tests/2.out"
"tests/2.err"
"tests/2.rc"
"tests/test_clone.c"
"tests/test_thread.c"
"tests/test_join_none.c"
"tests/test_thread_args.c"
"tests/test_thread_join_none.c"
"tests/3.desc"
"tests/3.run"
"tests/3.out"
"tests/3.err"
"tests/3.rc"
"tests/4.desc"
"tests/4.run"
"tests/4.out"
"tests/4.err"
"tests/4.rc"
"tests/5.desc"
"tests/5.run"
"tests/5.out"
"tests/5.err"
"tests/5.rc"
)

for file in "${PROTECTED_FILES[@]}"; do
if [ -f "$file" ]; then
sha256sum "$file" >> "$CHECKSUM_FILE"
echo " Protected: $file"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrity checksums for protected test files are written to CHECKSUM_FILE in /tmp, which is a world-writable location that code under test can modify or delete. An attacker can tamper with the checksum file between preprocessing and evaluation to make arbitrary changes to the supposedly protected test files while still passing the integrity check in evaluate.sh. Consider storing this checksum file in a directory not writable by the code under test (or tightening its permissions) and treating any modification or removal as a hard failure during evaluation.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +14
if [ -f /tmp/checksums/protected.sha256 ]; then
sha256sum -c /tmp/checksums/protected.sha256 || {
echo "FAIL: Protected files were modified"
exit 1
}
fi
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integrity verification in evaluate.sh relies on /tmp/checksums/protected.sha256, but skips the check entirely if the file is missing even though /tmp is writable by the untrusted submission. A malicious submission can remove or overwrite this checksum file after preprocess.sh runs to bypass detection of modifications to the "protected" test files while still allowing the script to continue. To harden this, store the checksum file in a location not writable by the code under test and treat a missing or invalid checksum file as an immediate evaluation failure.

Suggested change
if [ -f /tmp/checksums/protected.sha256 ]; then
sha256sum -c /tmp/checksums/protected.sha256 || {
echo "FAIL: Protected files were modified"
exit 1
}
fi
CHECKSUM_FILE="/workspace/checksums/protected.sha256"
if [ ! -f "$CHECKSUM_FILE" ]; then
echo "FAIL: Protected checksums file missing"
exit 1
fi
sha256sum -c "$CHECKSUM_FILE" || {
echo "FAIL: Protected files were modified"
exit 1
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant