ExamCraft MVP: bank ingestion → AI exam generation → chat revision by waple0820 · Pull Request #1 · waple0820/ExamCraft

waple0820 · 2026-05-03T11:32:01Z

Summary

First end-to-end build of ExamCraft from an empty repo.

Upload teachers' exam papers (.docx / .pdf) into a question bank → soffice (locked) renders to per-page PNGs → gpt-5.4 vision per page → a bank-level style + topic profile.
One click generates a fresh exam: gpt-5.4 builds a structured spec (the source of truth — questions, answers, knowledge points) → produces descriptive English page-prompts → gpt-image-2 fans them out with concurrency=3 and retry-jitter. Live SSE progress + page gallery.
Chat revision rewrites the spec from a parent's message, replans layout, and re-renders only the pages whose problem assignments changed.
Passwordless local auth, designer-feel UI (Inter + Fraunces, ivory + violet), Next.js 15.5.15 + FastAPI/uv.

Plan file: ~/.claude/plans/buzzing-sparking-raven.md. Memory: ~/.claude/projects/-Users-avatar-Desktop-projects-ExamCraft/memory/.

Validated end-to-end

Real DOCX from ~/Desktop/personal/试卷/ → 8 pages extracted → vision LLM correctly identified 标题 + 知识点 → bank profile captures Hubei middle-school style.
Generation built a 27-problem 6-page exam; gpt-image-2 produced visually-faithful pages (samples in backend/data/jobs/{id}/page_*.png after running).
Chat revision: input "再加一道关于圆的几何证明压轴题" → spec gained problem #27 (knowledge_point=圆) → page 6 re-rendered as a multi-part proof problem matching the bank's style.

Commits (7)

M1 backend + frontend skeletons
M2 passwordless cookie auth + bank CRUD
M3 ingestion pipeline (docx/pdf → vision → aggregation)
M4-M6 generation pipeline + chat revision + seed CLI (also fixes lib/ gitignore footgun that was dropping web/lib/* from M1+M2)
fix(M4) one bad page no longer fails the whole job
fix(M5) smarter revision diff (problem_ids per page, not prompt strings) + concurrent re-renders
fix(M5) revision worker writes terminal status back to DB

Known caveats (worth a glance during review)

gpt-image-2 has a ~10% silent-null rate; we retry 4× with jitter. Persistent 4xx on a single page is now non-fatal.
Ingestion / generation / revision aren't unit-tested (LLM + image API are hard to mock). Auth + bank CRUD have 7 passing tests.
Long verbatim CJK math text is intentionally NOT passed to gpt-image-2 — image is a stylized companion; spec JSON (printable, editable, displayed alongside) is the source of truth. This is the central product reframe the design hinges on.
soffice / pdftoppm are required brew installs; backend has a /api/system/check endpoint to verify.

Test plan

make setup then make dev — backend on :8000, web on :3000
Sign in with any username at http://localhost:3000/login
Create a bank, upload one .docx from ~/Desktop/personal/试卷/, watch the status badge progress through extracting → analyzing → ready
Confirm the Bank Profile panel renders style + knowledge_point bars + summary
Click Generate exam; watch the SSE-driven progress, page gallery filling in, and the spec viewer (toggle answers, raw JSON)
Send a chat message like "把第3题换成关于二次函数的题" — confirm the assistant reply, the spec change, and only the affected pages re-render
Restart backend mid-generation — confirm the in-flight job is flipped to failed on next boot
cd backend && uv run pytest -q → 7/7 green

Shortcut for review

cd backend && rm -rf data/* && uv run examcraft-seed --limit 1
# auto-creates user "papa" + bank "九年级数学" + ingests one DOCX from ~/Desktop/personal/试卷/

🤖 Generated with Claude Code

Backend (uv-managed FastAPI on Python 3.10+) exposes /api/health and is wired for upcoming auth/banks/samples/generations/chat routers. Pydantic settings load from repo-root .env and create the data/uploads/pages/jobs directories on first boot. Frontend (Next.js 15.5.15 + Tailwind + Inter/Fraunces) renders an editorial home page that fetches the backend health JSON server-side, plus a placeholder /login route so typedRoutes builds clean. Makefile brings up both servers in parallel; README documents the brew install poppler libreoffice prerequisite and the .env workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Backend gets a real schema (User, Bank) on aiosqlite + WAL, with a lazily-initialized async engine so tests can pin EXAMCRAFT_DATA_DIR to tmp_path. Auth is HMAC-signed cookies via itsdangerous — no DB sessions — issued on /api/auth/login, cleared on /logout, validated on /me. Bank routes are scoped to the calling user; tests cover the auth roundtrip, CRUD lifecycle, and cross-user isolation. Frontend route layout splits into a public /login and an auth-gated (app) group. The (app) layout calls getMe() server-side and redirect("/login")s when no session exists; SSR forwards the examcraft_session cookie to the backend via cookies(). Dashboard lists banks with status pills; an inline CreateBankCard handles the new-bank form. Bank detail is a shell with M3/M4 placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Backend grows three services and one router. docrender shells out to LibreOffice (serialized via asyncio.Lock + per-invocation -env:UserInstallation profile dir) for docx/doc/odt/rtf, then to pdf2image/poppler for the per-page PNGs at 200dpi. llm wraps litellm.acompletion for chat / chat_json / vision_json, all hitting the gateway in $OPENAI_BASE_URL with model gpt-5.4. ingestion stitches them together: per-page vision concurrency=5, then a bank-level aggregation that produces a style + topic profile JSON. JobRegistry holds onto the asyncio tasks so they survive across the request response, and an on-startup sweep flips any extracting/analyzing/running rows to error so restart-mid-job is recoverable. New tables: SampleExam, SampleExamPage, both cascading from Bank. Frontend bank-detail page gets three live sections: drop-zone uploader, sample list with 2s polling while any item is in flight, and an analysis panel that renders style profile, knowledge-point bars, problem-type bars, and the raw JSON behind a disclosure. Server actions + cookie forwarding follow the M2 pattern; new client helpers cover upload/delete/refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

M4 — generation pipeline - image_gen calls gpt-image-2 via httpx with retry-jitter (4 attempts), bounded asyncio.Semaphore(3), decoder that handles all three observed response shapes (b64_json, data:image/png;base64,…, http url), and an ImageGenError on persistent failure. - generation.run_generation orchestrates the full pipeline: bank profile → exam spec via gpt-5.4 → per-page descriptive English prompts via gpt-5.4 → image fan-out with an on_page callback that updates the DB and emits SSE events as each page lands. Spec JSON is the source of truth, the PNG is a stylized companion. - New tables GenerationJob + GeneratedPage. Startup sweep flips any queued/running jobs to failed so a restart is recoverable. - New per-job EventBus (app/sse.py) with bounded replay so the watch page survives refresh. - Frontend: /generations/[id] watch page subscribes to /events via EventSource, drives a progress bar, page gallery that fills live, an activity log, and a structured spec viewer with toggleable answers + raw JSON. M5 — chat revision - ChatMessage table, /api/generations/{id}/chat endpoints, revision.apply_revision worker. Worker has the LLM rewrite the spec given chat history, re-runs the layout planner, diffs prompts, and re-renders only the pages whose prompt changed. - Frontend ReviseChat panel embedded in the watch page; SSE streams the assistant reply and the re-rendered pages back into place. M6 — polish - examcraft-seed CLI: auto-creates a user + bank and queues ingestion of every supported file in a source directory (default ~/Desktop/personal/试卷/), with a --no-aggregate flag for stepping manually through the pipeline. - JobRegistry.in_flight() so the seeder waits without touching internals; ruff sweep across new modules. Also fixes a real footgun: .gitignore's "lib/" was matching web/lib/ at any depth, silently dropping the entire frontend API layer from M1+M2 commits. Pinned to /lib/ and committed the missing files. Without this the repo did not actually build from a fresh clone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

image_gen.generate_many now returns Path | Exception per index instead of raising on the first failure, and run_generation hooks both on_page and on_page_error so the DB records per-page status. The job ends in done state with current_step explaining how many failed; the gallery shows a "failed — chat to retry this page" placeholder for those pages, and a new SSE event page_error keeps the watch UI in sync. Caught from the first end-to-end run: 6/7 pages rendered fine, page 2 got persistent provider 400s through all 4 retries, and the whole job status flipped to failed. With this change the user gets a usable result plus a clear retry path through chat revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes after the first end-to-end chat-revision run took ~15min for a single problem swap: - The diff was string-comparing prompt text. The layout planner produces slightly different prose every call even when the underlying problem assignment is identical, so essentially every page looked "changed" and got re-rendered. Now we diff the *set of problem_ids per page*, reading the previous layout from prompts.json and writing the new one back so subsequent revisions can diff against it. A revision that swaps one problem usually only re-renders that one page. - Re-renders ran serially in a for-loop instead of going through the shared image_gen semaphore. Now uses asyncio.gather, with the same bounded concurrency=3 and partial-failure handling as the initial generation. on_page_error mirrors the generate path's behavior so the watch UI shows failed pages with a "chat to retry" placeholder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The chat endpoint flips the job to status='running' so the watch UI shows progress, but apply_revision was only emitting SSE events without mirroring the terminal state back to the DB. After every page rendered the job sat in 'running' forever until a manual restart. Now apply_revision calls _set_status on done / no-op done / failure, and also clears progress_pct back to 1.0 with a sensible current_step. Reusing _set_status from generation.py to keep the path identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@claude

Mirrors the browseruse-bench claude.yml — triggers on @claude mentions in comments / reviews / issues, and on every PR open or push. Uses the same self-hosted runner pattern because the LiteLLM gateway is on the internal network and isn't reachable from GitHub-hosted runners. Repo secrets ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL are set; pass them through to anthropics/claude-code-action@v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts a82ebad. No self-hosted runner available for this repo, and the LiteLLM gateway is internal-only so a GitHub-hosted runner can't reach it. Repo secrets ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL have been removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#2 — math typesetting The LLM emits LaTeX delimiters ($x^2$, \[…\], $…$$, $$…$$) verbatim in problem text and choices, so the page was showing raw "$-3$" instead of "−3". Adds a tiny <MathText> component that tokenises the four common delimiter pairs and runs each math chunk through KaTeX renderToString; non-math text is preserved verbatim with whitespace. Wired through ExamView for problem.content / choices / answer. KaTeX CSS imported once in globals.css. #3a — print cleanup The (app) layout's sticky header (brand / locale / sign-out) and the "← back to bank" link on the generation page weren't marked data-no-print, so they were leaking into the printed sheet. Added the attribute to both. Combined with the existing data-no-print on the header progress block, action toggles, FigureSlot loading/error states, problem-tag row, and chat panel, "打印" now produces just the exam article: title + meta + sections + problem text + figures + (optional) answers. #1 — English in profile (delivered via re-aggregation, not code) The legacy bank profile carried English snake_case keys for problem_type_distribution and an English string for style_profile.tone because it was aggregated before the prompt-language fix. Triggered a re-aggregation; the new profile has Chinese keys (选择题, 填空题, etc.) and Chinese values for tone / header_template / layout_pattern / typography. The frontend dictionary still carries fallback labels for any English-keyed banks aggregated before the prompt update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

waple0820 and others added 9 commits May 3, 2026 17:02

waple0820 merged commit 302aabc into main May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExamCraft MVP: bank ingestion → AI exam generation → chat revision#1

ExamCraft MVP: bank ingestion → AI exam generation → chat revision#1
waple0820 merged 9 commits into
mainfrom
feat/examcraft-mvp

waple0820 commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waple0820 commented May 3, 2026

Summary

Validated end-to-end

Commits (7)

Known caveats (worth a glance during review)

Test plan

Shortcut for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant