Skip to content

ExamCraft MVP: bank ingestion → AI exam generation → chat revision#1

Merged
waple0820 merged 9 commits into
mainfrom
feat/examcraft-mvp
May 3, 2026
Merged

ExamCraft MVP: bank ingestion → AI exam generation → chat revision#1
waple0820 merged 9 commits into
mainfrom
feat/examcraft-mvp

Conversation

@waple0820

Copy link
Copy Markdown
Owner

Summary

First end-to-end build of ExamCraft from an empty repo.

  • Upload teachers' exam papers (.docx / .pdf) into a question bank → soffice (locked) renders to per-page PNGs → gpt-5.4 vision per page → a bank-level style + topic profile.
  • One click generates a fresh exam: gpt-5.4 builds a structured spec (the source of truth — questions, answers, knowledge points) → produces descriptive English page-prompts → gpt-image-2 fans them out with concurrency=3 and retry-jitter. Live SSE progress + page gallery.
  • Chat revision rewrites the spec from a parent's message, replans layout, and re-renders only the pages whose problem assignments changed.
  • Passwordless local auth, designer-feel UI (Inter + Fraunces, ivory + violet), Next.js 15.5.15 + FastAPI/uv.

Plan file: ~/.claude/plans/buzzing-sparking-raven.md. Memory: ~/.claude/projects/-Users-avatar-Desktop-projects-ExamCraft/memory/.

Validated end-to-end

  • Real DOCX from ~/Desktop/personal/试卷/ → 8 pages extracted → vision LLM correctly identified 标题 + 知识点 → bank profile captures Hubei middle-school style.
  • Generation built a 27-problem 6-page exam; gpt-image-2 produced visually-faithful pages (samples in backend/data/jobs/{id}/page_*.png after running).
  • Chat revision: input "再加一道关于圆的几何证明压轴题" → spec gained problem #27 (knowledge_point=圆) → page 6 re-rendered as a multi-part proof problem matching the bank's style.

Commits (7)

  • M1 backend + frontend skeletons
  • M2 passwordless cookie auth + bank CRUD
  • M3 ingestion pipeline (docx/pdf → vision → aggregation)
  • M4-M6 generation pipeline + chat revision + seed CLI (also fixes lib/ gitignore footgun that was dropping web/lib/* from M1+M2)
  • fix(M4) one bad page no longer fails the whole job
  • fix(M5) smarter revision diff (problem_ids per page, not prompt strings) + concurrent re-renders
  • fix(M5) revision worker writes terminal status back to DB

Known caveats (worth a glance during review)

  • gpt-image-2 has a ~10% silent-null rate; we retry 4× with jitter. Persistent 4xx on a single page is now non-fatal.
  • Ingestion / generation / revision aren't unit-tested (LLM + image API are hard to mock). Auth + bank CRUD have 7 passing tests.
  • Long verbatim CJK math text is intentionally NOT passed to gpt-image-2 — image is a stylized companion; spec JSON (printable, editable, displayed alongside) is the source of truth. This is the central product reframe the design hinges on.
  • soffice / pdftoppm are required brew installs; backend has a /api/system/check endpoint to verify.

Test plan

  • make setup then make dev — backend on :8000, web on :3000
  • Sign in with any username at http://localhost:3000/login
  • Create a bank, upload one .docx from ~/Desktop/personal/试卷/, watch the status badge progress through extracting → analyzing → ready
  • Confirm the Bank Profile panel renders style + knowledge_point bars + summary
  • Click Generate exam; watch the SSE-driven progress, page gallery filling in, and the spec viewer (toggle answers, raw JSON)
  • Send a chat message like "把第3题换成关于二次函数的题" — confirm the assistant reply, the spec change, and only the affected pages re-render
  • Restart backend mid-generation — confirm the in-flight job is flipped to failed on next boot
  • cd backend && uv run pytest -q → 7/7 green

Shortcut for review

cd backend && rm -rf data/* && uv run examcraft-seed --limit 1
# auto-creates user "papa" + bank "九年级数学" + ingests one DOCX from ~/Desktop/personal/试卷/

🤖 Generated with Claude Code

waple0820 and others added 9 commits May 3, 2026 17:02
Backend (uv-managed FastAPI on Python 3.10+) exposes /api/health and is
wired for upcoming auth/banks/samples/generations/chat routers. Pydantic
settings load from repo-root .env and create the data/uploads/pages/jobs
directories on first boot.

Frontend (Next.js 15.5.15 + Tailwind + Inter/Fraunces) renders an
editorial home page that fetches the backend health JSON server-side, plus
a placeholder /login route so typedRoutes builds clean.

Makefile brings up both servers in parallel; README documents the
brew install poppler libreoffice prerequisite and the .env workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend gets a real schema (User, Bank) on aiosqlite + WAL, with a
lazily-initialized async engine so tests can pin EXAMCRAFT_DATA_DIR to
tmp_path. Auth is HMAC-signed cookies via itsdangerous — no DB sessions —
issued on /api/auth/login, cleared on /logout, validated on /me. Bank
routes are scoped to the calling user; tests cover the auth roundtrip,
CRUD lifecycle, and cross-user isolation.

Frontend route layout splits into a public /login and an auth-gated (app)
group. The (app) layout calls getMe() server-side and redirect("/login")s
when no session exists; SSR forwards the examcraft_session cookie to the
backend via cookies(). Dashboard lists banks with status pills; an inline
CreateBankCard handles the new-bank form. Bank detail is a shell with
M3/M4 placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend grows three services and one router. docrender shells out to
LibreOffice (serialized via asyncio.Lock + per-invocation
-env:UserInstallation profile dir) for docx/doc/odt/rtf, then to
pdf2image/poppler for the per-page PNGs at 200dpi. llm wraps
litellm.acompletion for chat / chat_json / vision_json, all hitting the
gateway in $OPENAI_BASE_URL with model gpt-5.4. ingestion stitches them
together: per-page vision concurrency=5, then a bank-level aggregation
that produces a style + topic profile JSON. JobRegistry holds onto the
asyncio tasks so they survive across the request response, and an
on-startup sweep flips any extracting/analyzing/running rows to error so
restart-mid-job is recoverable. New tables: SampleExam, SampleExamPage,
both cascading from Bank.

Frontend bank-detail page gets three live sections: drop-zone uploader,
sample list with 2s polling while any item is in flight, and an analysis
panel that renders style profile, knowledge-point bars, problem-type bars,
and the raw JSON behind a disclosure. Server actions + cookie forwarding
follow the M2 pattern; new client helpers cover upload/delete/refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M4 — generation pipeline
- image_gen calls gpt-image-2 via httpx with retry-jitter (4 attempts),
  bounded asyncio.Semaphore(3), decoder that handles all three observed
  response shapes (b64_json, data:image/png;base64,…, http url), and an
  ImageGenError on persistent failure.
- generation.run_generation orchestrates the full pipeline: bank profile
  → exam spec via gpt-5.4 → per-page descriptive English prompts via
  gpt-5.4 → image fan-out with an on_page callback that updates the DB
  and emits SSE events as each page lands. Spec JSON is the source of
  truth, the PNG is a stylized companion.
- New tables GenerationJob + GeneratedPage. Startup sweep flips any
  queued/running jobs to failed so a restart is recoverable.
- New per-job EventBus (app/sse.py) with bounded replay so the watch
  page survives refresh.
- Frontend: /generations/[id] watch page subscribes to /events via
  EventSource, drives a progress bar, page gallery that fills live, an
  activity log, and a structured spec viewer with toggleable answers +
  raw JSON.

M5 — chat revision
- ChatMessage table, /api/generations/{id}/chat endpoints,
  revision.apply_revision worker. Worker has the LLM rewrite the spec
  given chat history, re-runs the layout planner, diffs prompts, and
  re-renders only the pages whose prompt changed.
- Frontend ReviseChat panel embedded in the watch page; SSE streams the
  assistant reply and the re-rendered pages back into place.

M6 — polish
- examcraft-seed CLI: auto-creates a user + bank and queues ingestion
  of every supported file in a source directory (default
  ~/Desktop/personal/试卷/), with a --no-aggregate flag for stepping
  manually through the pipeline.
- JobRegistry.in_flight() so the seeder waits without touching
  internals; ruff sweep across new modules.

Also fixes a real footgun: .gitignore's "lib/" was matching web/lib/ at
any depth, silently dropping the entire frontend API layer from M1+M2
commits. Pinned to /lib/ and committed the missing files. Without this
the repo did not actually build from a fresh clone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
image_gen.generate_many now returns Path | Exception per index instead of
raising on the first failure, and run_generation hooks both on_page and
on_page_error so the DB records per-page status. The job ends in done
state with current_step explaining how many failed; the gallery shows a
"failed — chat to retry this page" placeholder for those pages, and a
new SSE event page_error keeps the watch UI in sync.

Caught from the first end-to-end run: 6/7 pages rendered fine, page 2
got persistent provider 400s through all 4 retries, and the whole job
status flipped to failed. With this change the user gets a usable result
plus a clear retry path through chat revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes after the first end-to-end chat-revision run took ~15min for a
single problem swap:

- The diff was string-comparing prompt text. The layout planner produces
  slightly different prose every call even when the underlying problem
  assignment is identical, so essentially every page looked "changed" and
  got re-rendered. Now we diff the *set of problem_ids per page*, reading
  the previous layout from prompts.json and writing the new one back so
  subsequent revisions can diff against it. A revision that swaps one
  problem usually only re-renders that one page.

- Re-renders ran serially in a for-loop instead of going through the
  shared image_gen semaphore. Now uses asyncio.gather, with the same
  bounded concurrency=3 and partial-failure handling as the initial
  generation. on_page_error mirrors the generate path's behavior so the
  watch UI shows failed pages with a "chat to retry" placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat endpoint flips the job to status='running' so the watch UI shows
progress, but apply_revision was only emitting SSE events without
mirroring the terminal state back to the DB. After every page rendered
the job sat in 'running' forever until a manual restart.

Now apply_revision calls _set_status on done / no-op done / failure, and
also clears progress_pct back to 1.0 with a sensible current_step.
Reusing _set_status from generation.py to keep the path identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the browseruse-bench claude.yml — triggers on @claude mentions in
comments / reviews / issues, and on every PR open or push. Uses the same
self-hosted runner pattern because the LiteLLM gateway is on the internal
network and isn't reachable from GitHub-hosted runners.

Repo secrets ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL are set; pass them
through to anthropics/claude-code-action@v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts a82ebad. No self-hosted runner available for this repo, and
the LiteLLM gateway is internal-only so a GitHub-hosted runner can't
reach it. Repo secrets ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL have
been removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@waple0820 waple0820 merged commit 302aabc into main May 3, 2026
waple0820 added a commit that referenced this pull request May 3, 2026
#2 — math typesetting
The LLM emits LaTeX delimiters (\(x^2\), \[…\], $…$$, $$…$$) verbatim in
problem text and choices, so the page was showing raw "\(-3\)" instead
of "−3". Adds a tiny <MathText> component that tokenises the four
common delimiter pairs and runs each math chunk through KaTeX
renderToString; non-math text is preserved verbatim with whitespace.
Wired through ExamView for problem.content / choices / answer. KaTeX
CSS imported once in globals.css.

#3a — print cleanup
The (app) layout's sticky header (brand / locale / sign-out) and the
"← back to bank" link on the generation page weren't marked
data-no-print, so they were leaking into the printed sheet. Added the
attribute to both. Combined with the existing data-no-print on the
header progress block, action toggles, FigureSlot loading/error
states, problem-tag row, and chat panel, "打印" now produces just the
exam article: title + meta + sections + problem text + figures +
(optional) answers.

#1 — English in profile (delivered via re-aggregation, not code)
The legacy bank profile carried English snake_case keys for
problem_type_distribution and an English string for style_profile.tone
because it was aggregated before the prompt-language fix. Triggered a
re-aggregation; the new profile has Chinese keys (选择题, 填空题, etc.)
and Chinese values for tone / header_template / layout_pattern /
typography. The frontend dictionary still carries fallback labels for
any English-keyed banks aggregated before the prompt update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant