Pivot to Steiner translation SFT training on RunPod by nirdrang · Pull Request #230 · VectifyAI/PageIndex

nirdrang · 2026-04-12T08:26:38Z

Summary

This PR pivots the repository from a document indexing/RAG tool (PageIndex) to an English→Hebrew Steiner text translation project using supervised fine-tuning (SFT) on RunPod infrastructure.

Key Changes

Removed:

PageIndex document indexing module (pageindex/ package, run_pageindex.py)
All PageIndex cookbooks and tutorials
Test PDFs and generated structure JSON files
Original README, LICENSE, and CHANGELOG

Added:

Training data: steiner_3k_train.jsonl, steiner_20k_train.jsonl, steiner_val.jsonl for English→Hebrew translation pairs
SFT orchestration: sft/run_sft.py — RunPod pod lifecycle management (create, train, infer, cleanup)
Benchmarking: sft/benchmark_pod.py and sft/benchmark_sft.py for GPU performance measurement
Axolotl config: sft/axolotl_config_3k.yaml — LoRA fine-tuning on openai/gpt-oss-120b with MXFP4 quantization
Evaluation: run_eval.py for translation quality assessment using Vertex AI (COMET, BLEU, MetricX) + terminology recall
Batch translation: create_batch_gpt54mini.py for OpenAI batch API submission
Glossary: glossary.json with 180+ Steiner-specific terminology mappings (English→Hebrew)
Evaluation datasets: Multiple batch input/output files and CSV results for no-glossary and with-glossary variants
Documentation: .claude/plans/lever-exploration-plan.md (1590 lines) detailing the SFT approach, memory model, and training strategy
RunPod skills: .claude/skills/flash/SKILL.md and .claude/skills/runpodctl/SKILL.md for deployment automation
Project config: .claude/settings.json, .claude/hooks/session-start.sh, CLAUDE.md for Claude Code environment setup

Modified:

requirements.txt: Replaced PageIndex dependencies (pymupdf, PyPDF2, tiktoken) with translation/evaluation stack (openai 2.30.0, google-cloud-aiplatform, pandas)
.gitignore: Added gcp-sa-key.json and sft/benchmark_results.json

Implementation Details

Training method: LoRA (not QLoRA) on native MXFP4 model to avoid quality loss while fitting in ~65GB VRAM with gradient checkpointing
Evaluation: Multi-metric approach combining automatic metrics (COMET, BLEU, MetricX) with terminology recall against glossary
Infrastructure: RunPod Secure Cloud with RTX PRO 6000 Blackwell GPUs, orchestrated via Python async/websocket API
Phase 1 focus: SFT only; Phase 2 (optional) would add self-rejection DPO

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Installs project-level Claude Code skills for working with RunPod infrastructure: flash (serverless GPU/CPU deployment SDK/CLI) and runpodctl (pod/endpoint/volume management CLI). https://claude.ai/code/session_012ryUKq1DUyU68zCTfdtBp4

…ion-WGq9u Add RunPod flash and runpodctl Claude Code skills

Automatically installs runpodctl CLI and configures the API key from .env on remote Claude Code sessions. https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Clearing out PageIndex-specific code to repurpose the repo. https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

openai 1.101.0 -> 2.30.0, python-dotenv 1.1.0 -> 1.2.2, tiktoken 0.11.0 -> 0.12.0 https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Add session-start hook for runpodctl setup

Steiner English-to-Hebrew translation batch (200 requests) using fine-tuned GPT-4.1 model. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Second batch file (file-9qv2F9EJLyrACDQZNc6cy9) - same 200 translation requests with shorter instructions. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

…on scripts - glossary.json: 179 English→Hebrew anthroposophical term mappings extracted from batch instructions - create_batch_gpt54mini.py: creates and submits GPT-5.4 mini batch jobs (with/without glossary) - run_eval.py: Vertex AI evaluation (COMET, BLEU, MetricX) + terminology recall + composite score - Updated requirements.txt with google-cloud-aiplatform and pandas https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Downloaded from OpenAI (file-R5q3PdQWzjEXm5FLRmRGU5). https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Selected 200 paragraphs from gpt41_full_train.jsonl that: - Are NOT in gpt41_3k_priority.jsonl or gpt41_20k_val.jsonl - Match the original 200 eval paragraphs in length and glossary term distribution - Include both English source and Hebrew reference translations Also adds gpt41_full_train.jsonl (20,281 training examples). https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

- eval_source.jsonl: 200 English paragraphs (custom_id + source) - eval_reference.jsonl: 200 Hebrew reference translations (custom_id + reference) https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

gpt41_full_train.jsonl: 20281 → 20081 lines. 3k priority and val files already had zero overlap. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

gpt41_full_train.jsonl → gpt41_20k_train.jsonl gpt41_3k_priority.jsonl → gpt41_3k_train.jsonl gpt41_20k_val.jsonl → gpt41_val.jsonl https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

steiner_val.jsonl: 500 → 50 steiner_20k_train.jsonl: 20,081 → 20,531 https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

MetricX auto-detects the reference column and uses METRICX_24_SRC_REF when it's present in the dataset DataFrame. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

…odels) https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Uses eval_source.jsonl (200 paragraphs with reference translations available). https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

EvalTask wrapper has a parsing bug for COMET/MetricX results. Now uses EvaluationServiceClient directly with explicit language params: - COMET_22_SRC_REF (en→he) - METRICX_24_SRC_REF (en→he) - BLEU https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Replaces direct API calls with proper SDK usage: - pointwise_metric.Comet(source_language='en', target_language='he') - pointwise_metric.MetricX(source_language='en', target_language='he', version='METRICX_24_SRC_REF') https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

COMET: 0.8262, BLEU: 0.1612, MetricX: 3.6761, Term recall: 0.8667 Composite: 0.7439 https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

COMET: 0.8212, BLEU: 0.1504, MetricX: 3.6931, Term recall: 0.7400 Composite: 0.7280 https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

- sft/axolotl_config_3k.yaml: LoRA config for GPT-OSS-120B with MXFP4 - sft/run_sft.py: Full orchestration (create pod, upload data, train, infer, cleanup) - eval_results_with_glossary.csv: GPT-5.4 mini eval results https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

…script - sft/axolotl_config_3k.yaml: LoRA config for GPT-OSS-120B (Mxfp4Config + dequantize, lora r=8, sample_packing, 1 epoch) - sft/run_sft.py: full orchestrator (create pod, upload data, train, infer, cleanup) via Jupyter kernel API - sft/benchmark_sft.py: SFT speed benchmark on a single GPU (~30 steps), measures samples/sec and steps/sec https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Skips pod creation; runs benchmark on already-running pods via Jupyter. Supports parallel benchmarking of multiple pods. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Splits setup metrics into: - axolotl_install_s - model_download_s (snapshot_download from HF) - train_time_s Also skips pod uptime check (often inaccurate); polls Jupyter directly. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Websocket has 10MB message limit; the 3k train file is 10MB raw / 13MB b64. Switching to PUT /api/contents/<path> avoids the websocket message limit. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

- pip install axolotl without [flash-attn] extras (compilation often fails) - Use plain string concatenation for download script instead of multiline triple-quoted https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

…xolotl The base pytorch image has debian-packaged cryptography without a RECORD file, so pip cannot uninstall it. Installing with --ignore-installed bypasses this. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Key additions to the SFT tuning plan: - Lever exploration methodology with 3-stage evaluation (eval_loss → pairwise → composite) - 3k vs 20k lever placement: data-shape levers (Tier 4) run on cheap 3k subset, hyperparameter sweeps (Tier 1) on 20k only - num_epochs changed from hardcoded 2 to TBD via schedule shootout (1, 2, 3 tested empirically) - Successive halving for family sweeps (LR, rank) - Proposed 15-experiment sequence with cost estimates https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

stomde

I say i

stomde

I say i

stomde

I say i

stomde

Ini

claude and others added 30 commits April 8, 2026 07:56

Add RunPod flash and runpodctl Claude Code skills

261c8e9

Installs project-level Claude Code skills for working with RunPod infrastructure: flash (serverless GPU/CPU deployment SDK/CLI) and runpodctl (pod/endpoint/volume management CLI). https://claude.ai/code/session_012ryUKq1DUyU68zCTfdtBp4

Merge pull request #1 from nirdrang/claude/tree-construction-explanat…

30ffcbc

…ion-WGq9u Add RunPod flash and runpodctl Claude Code skills

Add session-start hook for runpodctl setup

97e4ea2

Automatically installs runpodctl CLI and configures the API key from .env on remote Claude Code sessions. https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Remove pageindex, cookbook, tests, and tutorials directories

889f8a3

Clearing out PageIndex-specific code to repurpose the repo. https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Remove CHANGELOG.md, README.md, and run_pageindex.py

b70eb56

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Remove pyyaml, pymupdf, PyPDF2 from requirements.txt

63ecc07

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Update all dependencies to latest versions

a76b05b

openai 1.101.0 -> 2.30.0, python-dotenv 1.1.0 -> 1.2.2, tiktoken 0.11.0 -> 0.12.0 https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Remove LICENSE file

ad63a1b

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Add gpt41_3k_priority.jsonl fine-tuning dataset

8c41e24

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Add gpt41_20k_val.jsonl fine-tuning validation dataset

d49a422

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Add auto-approve permissions for core tools

3c530ff

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Enable bypassPermissions mode for Claude Code

80b46eb

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Add CLAUDE.md to prompt for API keys on new sessions

7806b28

https://claude.ai/code/session_01Sng6PreXjajcMAQLc5WYkR

Merge pull request #2 from nirdrang/claude/document-skills-24mwB

81ccf8e

Add session-start hook for runpodctl setup

Add batch_input.jsonl downloaded from OpenAI API

8e45c79

Steiner English-to-Hebrew translation batch (200 requests) using fine-tuned GPT-4.1 model. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add batch_input_2.jsonl downloaded from OpenAI API

b79132c

Second batch file (file-9qv2F9EJLyrACDQZNc6cy9) - same 200 translation requests with shorter instructions. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Rename batch files to reflect glossary vs no-glossary variants

513b6ef

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add full fine-tuning training dataset (20,281 English-Hebrew pairs)

efbb8d3

Downloaded from OpenAI (file-R5q3PdQWzjEXm5FLRmRGU5). https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Split eval test set into separate source and reference files

9067652

- eval_source.jsonl: 200 English paragraphs (custom_id + source) - eval_reference.jsonl: 200 Hebrew reference translations (custom_id + reference) https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Remove 200 eval test paragraphs from full training set

8c9571c

gpt41_full_train.jsonl: 20281 → 20081 lines. 3k priority and val files already had zero overlap. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Rename dataset files for clarity

4e55618

gpt41_full_train.jsonl → gpt41_20k_train.jsonl gpt41_3k_priority.jsonl → gpt41_3k_train.jsonl gpt41_20k_val.jsonl → gpt41_val.jsonl https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Rename gpt41_ prefix to steiner_ for dataset files

69d9593

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Trim validation set to 50, move 450 examples back to training

aa98df5

steiner_val.jsonl: 500 → 50 steiner_20k_train.jsonl: 20,081 → 20,531 https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add comment confirming MetricX uses reference-based (SRC_REF) mode

f387b57

MetricX auto-detects the reference column and uses METRICX_24_SRC_REF when it's present in the dataset DataFrame. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add gcp-sa-key.json to .gitignore

c05a74f

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Set reasoning effort to low for GPT-5.4 mini batch requests

3f83378

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add GPT-5.4 mini batch input files (with and without glossary)

2263459

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Remove temperature param for GPT-5.4 mini (unsupported on reasoning m…

25c2ec8

…odels) https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

claude added 20 commits April 9, 2026 07:05

Add GPT-5.4 mini batch output files (200/200 completed, 0 failures)

eedef34

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add eval test set batch inputs for GPT-5.4 mini (with/without glossary)

d4b42bf

Uses eval_source.jsonl (200 paragraphs with reference translations available). https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add eval test set batch outputs from GPT-5.4 mini (200/200 complete)

4fb9637

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Support eval_source.jsonl format in load_sources

1fcf485

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add with-glossary eval results (GPT-5.4 mini, 200 entries)

cb726ae

COMET: 0.8262, BLEU: 0.1612, MetricX: 3.6761, Term recall: 0.8667 Composite: 0.7439 https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add no-glossary eval results (GPT-5.4 mini, 200 entries)

4e1242c

COMET: 0.8212, BLEU: 0.1504, MetricX: 3.6931, Term recall: 0.7400 Composite: 0.7280 https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add benchmark_pod.py to benchmark SFT speed on existing pods

4760475

Skips pod creation; runs benchmark on already-running pods via Jupyter. Supports parallel benchmarking of multiple pods. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Fix: make upload_file async to avoid asyncio.run inside running loop

73fed08

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Use Jupyter HTTP contents API for file uploads (handles large files)

89b0525

Websocket has 10MB message limit; the 3k train file is 10MB raw / 13MB b64. Switching to PUT /api/contents/<path> avoids the websocket message limit. https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Add benchmark_results.json to .gitignore

b8250b0

Switch to new H100 SXM pod kpkz9k79r9mgtd

652167e

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

Stream file uploads in chunks via kernel; pod is RTX PRO 6000

4ad3795

https://claude.ai/code/session_01QT56nFPcmWN9mgE8aEbnLb

claude bot reviewed Apr 12, 2026

View reviewed changes

stomde approved these changes Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivot to Steiner translation SFT training on RunPod#230

Pivot to Steiner translation SFT training on RunPod#230
nirdrang wants to merge 50 commits intoVectifyAI:mainfrom
nirdrang:claude/start-new-session-MUXOF

nirdrang commented Apr 12, 2026

Uh oh!

claude bot left a comment

Uh oh!

stomde left a comment

Uh oh!

stomde left a comment

Uh oh!

stomde left a comment

Uh oh!

stomde left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nirdrang commented Apr 12, 2026

Summary

Key Changes

Implementation Details

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

stomde left a comment

Choose a reason for hiding this comment

Uh oh!

stomde left a comment

Choose a reason for hiding this comment

Uh oh!

stomde left a comment

Choose a reason for hiding this comment

Uh oh!

stomde left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants