Skip to content

Pivot to Steiner translation SFT training on RunPod#230

Open
nirdrang wants to merge 50 commits intoVectifyAI:mainfrom
nirdrang:claude/start-new-session-MUXOF
Open

Pivot to Steiner translation SFT training on RunPod#230
nirdrang wants to merge 50 commits intoVectifyAI:mainfrom
nirdrang:claude/start-new-session-MUXOF

Conversation

@nirdrang
Copy link
Copy Markdown

Summary

This PR pivots the repository from a document indexing/RAG tool (PageIndex) to an English→Hebrew Steiner text translation project using supervised fine-tuning (SFT) on RunPod infrastructure.

Key Changes

Removed:

  • PageIndex document indexing module (pageindex/ package, run_pageindex.py)
  • All PageIndex cookbooks and tutorials
  • Test PDFs and generated structure JSON files
  • Original README, LICENSE, and CHANGELOG

Added:

  • Training data: steiner_3k_train.jsonl, steiner_20k_train.jsonl, steiner_val.jsonl for English→Hebrew translation pairs
  • SFT orchestration: sft/run_sft.py — RunPod pod lifecycle management (create, train, infer, cleanup)
  • Benchmarking: sft/benchmark_pod.py and sft/benchmark_sft.py for GPU performance measurement
  • Axolotl config: sft/axolotl_config_3k.yaml — LoRA fine-tuning on openai/gpt-oss-120b with MXFP4 quantization
  • Evaluation: run_eval.py for translation quality assessment using Vertex AI (COMET, BLEU, MetricX) + terminology recall
  • Batch translation: create_batch_gpt54mini.py for OpenAI batch API submission
  • Glossary: glossary.json with 180+ Steiner-specific terminology mappings (English→Hebrew)
  • Evaluation datasets: Multiple batch input/output files and CSV results for no-glossary and with-glossary variants
  • Documentation: .claude/plans/lever-exploration-plan.md (1590 lines) detailing the SFT approach, memory model, and training strategy
  • RunPod skills: .claude/skills/flash/SKILL.md and .claude/skills/runpodctl/SKILL.md for deployment automation
  • Project config: .claude/settings.json, .claude/hooks/session-start.sh, CLAUDE.md for Claude Code environment setup

Modified:

  • requirements.txt: Replaced PageIndex dependencies (pymupdf, PyPDF2, tiktoken) with translation/evaluation stack (openai 2.30.0, google-cloud-aiplatform, pandas)
  • .gitignore: Added gcp-sa-key.json and sft/benchmark_results.json

Implementation Details

  • Training method: LoRA (not QLoRA) on native MXFP4 model to avoid quality loss while fitting in ~65GB VRAM with gradient checkpointing
  • Evaluation: Multi-metric approach combining automatic metrics (COMET, BLEU, MetricX) with terminology recall against glossary
  • Infrastructure: RunPod Secure Cloud with RTX PRO 6000 Blackwell GPUs, orchestrated via Python async/websocket API
  • Phase 1 focus: SFT only; Phase 2 (optional) would add self-rejection DPO

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

claude and others added 30 commits April 8, 2026 07:56
Installs project-level Claude Code skills for working with RunPod
infrastructure: flash (serverless GPU/CPU deployment SDK/CLI) and
runpodctl (pod/endpoint/volume management CLI).

https://claude.ai/code/session_012ryUKq1DUyU68zCTfdtBp4
…ion-WGq9u

Add RunPod flash and runpodctl Claude Code skills
Automatically installs runpodctl CLI and configures the API key from
.env on remote Claude Code sessions.

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR
openai 1.101.0 -> 2.30.0, python-dotenv 1.1.0 -> 1.2.2, tiktoken 0.11.0 -> 0.12.0

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR
Add session-start hook for runpodctl setup
Steiner English-to-Hebrew translation batch (200 requests) using fine-tuned GPT-4.1 model.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Second batch file (file-9qv2F9EJLyrACDQZNc6cy9) - same 200 translation requests with shorter instructions.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
…on scripts

- glossary.json: 179 English→Hebrew anthroposophical term mappings extracted from batch instructions
- create_batch_gpt54mini.py: creates and submits GPT-5.4 mini batch jobs (with/without glossary)
- run_eval.py: Vertex AI evaluation (COMET, BLEU, MetricX) + terminology recall + composite score
- Updated requirements.txt with google-cloud-aiplatform and pandas

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Selected 200 paragraphs from gpt41_full_train.jsonl that:
- Are NOT in gpt41_3k_priority.jsonl or gpt41_20k_val.jsonl
- Match the original 200 eval paragraphs in length and glossary term distribution
- Include both English source and Hebrew reference translations

Also adds gpt41_full_train.jsonl (20,281 training examples).

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
- eval_source.jsonl: 200 English paragraphs (custom_id + source)
- eval_reference.jsonl: 200 Hebrew reference translations (custom_id + reference)

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
gpt41_full_train.jsonl: 20281 → 20081 lines.
3k priority and val files already had zero overlap.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
gpt41_full_train.jsonl → gpt41_20k_train.jsonl
gpt41_3k_priority.jsonl → gpt41_3k_train.jsonl
gpt41_20k_val.jsonl → gpt41_val.jsonl

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
MetricX auto-detects the reference column and uses METRICX_24_SRC_REF
when it's present in the dataset DataFrame.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
claude added 20 commits April 9, 2026 07:05
EvalTask wrapper has a parsing bug for COMET/MetricX results.
Now uses EvaluationServiceClient directly with explicit language params:
- COMET_22_SRC_REF (en→he)
- METRICX_24_SRC_REF (en→he)
- BLEU

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Replaces direct API calls with proper SDK usage:
- pointwise_metric.Comet(source_language='en', target_language='he')
- pointwise_metric.MetricX(source_language='en', target_language='he', version='METRICX_24_SRC_REF')

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
COMET: 0.8262, BLEU: 0.1612, MetricX: 3.6761, Term recall: 0.8667
Composite: 0.7439

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
COMET: 0.8212, BLEU: 0.1504, MetricX: 3.6931, Term recall: 0.7400
Composite: 0.7280

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
- sft/axolotl_config_3k.yaml: LoRA config for GPT-OSS-120B with MXFP4
- sft/run_sft.py: Full orchestration (create pod, upload data, train, infer, cleanup)
- eval_results_with_glossary.csv: GPT-5.4 mini eval results

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
…script

- sft/axolotl_config_3k.yaml: LoRA config for GPT-OSS-120B (Mxfp4Config + dequantize, lora r=8, sample_packing, 1 epoch)
- sft/run_sft.py: full orchestrator (create pod, upload data, train, infer, cleanup) via Jupyter kernel API
- sft/benchmark_sft.py: SFT speed benchmark on a single GPU (~30 steps), measures samples/sec and steps/sec

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Skips pod creation; runs benchmark on already-running pods via Jupyter.
Supports parallel benchmarking of multiple pods.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Splits setup metrics into:
- axolotl_install_s
- model_download_s (snapshot_download from HF)
- train_time_s

Also skips pod uptime check (often inaccurate); polls Jupyter directly.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Websocket has 10MB message limit; the 3k train file is 10MB raw / 13MB b64.
Switching to PUT /api/contents/<path> avoids the websocket message limit.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
- pip install axolotl without [flash-attn] extras (compilation often fails)
- Use plain string concatenation for download script instead of multiline triple-quoted

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
…xolotl

The base pytorch image has debian-packaged cryptography without a RECORD file,
so pip cannot uninstall it. Installing with --ignore-installed bypasses this.

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Key additions to the SFT tuning plan:
- Lever exploration methodology with 3-stage evaluation (eval_loss → pairwise → composite)
- 3k vs 20k lever placement: data-shape levers (Tier 4) run on cheap 3k subset, hyperparameter sweeps (Tier 1) on 20k only
- num_epochs changed from hardcoded 2 to TBD via schedule shootout (1, 2, 3 tested empirically)
- Successive halving for family sweeps (LR, rank)
- Proposed 15-experiment sequence with cost estimates

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

@stomde stomde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say i

Copy link
Copy Markdown

@stomde stomde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say i

Copy link
Copy Markdown

@stomde stomde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say i

Copy link
Copy Markdown

@stomde stomde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants