Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/harness.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ jobs:
- name: Run tests
run: npm test

- name: Run chat-demo (smoke test)
run: npx tsx examples/chat-demo.ts "What is 2+2?"
- name: Run chat pipeline demo (smoke test)
run: npx tsx examples/chat-pipeline-demo.ts "What is 2+2?"

- name: Run coder-demo (smoke test)
run: npx tsx examples/coder-demo.ts "Add error handling to the code"
- name: Run coder pipeline demo (smoke test)
run: npx tsx examples/coder-pipeline-demo.ts "Add error handling to the code"
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,6 @@ The project follows semantic versioning for schema and registry compatibility:
### Changed

- Breaking temporal normalization across governance RFCs: canonical fields (`observed_at`, `decided_at`, `effective_at`, `expires_at`, `started_at`, `completed_at`, `superseded_at`) replace legacy aliases.
- Governance spine schemas updated (policy, permissions, delegation, audit, receipts, lifecycle, telemetry, memory, multi-agent protocol) and registry regenerated.
- Governance spine schemas updated (policy, permissions, delegation, audit, receipts, lifecycle, telemetry, memory, multi-party protocol) and registry regenerated.
- Reference harness runtime/types aligned to canonical temporal fields, logical ordering metadata, and updated governance artifacts.
- Example fixtures now use registry shortname folders for delegation, permissions, and execution/audit receipts.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ This makes Open CoT useful beyond any one framework. An implementation can use R
|------|------|
| [`rfcs/`](./rfcs/) | **53 RFCs** covering reasoning traces, tool invocation, governed execution, policy, delegation, receipts, capability manifests, cognitive artifacts, and reconciliation results |
| [`schemas/`](./schemas/) | Versioned JSON Schemas per RFC, including `registry.json` |
| [`harness/`](./harness/) | Reference TypeScript harness that exercises earlier governed execution RFCs |
| [`harness/`](./harness/) | Reference TypeScript core package that exercises earlier governed execution RFCs |
| [`examples/`](./examples/) | Validated instance fixtures keyed by registry shortname |
| [`reference/python/`](./reference/python/) | Reference Python tooling |
| [`tools/`](./tools/) | Schema and fixture validation, registry sync, and RFC helpers |
Expand Down Expand Up @@ -89,7 +89,7 @@ pip install -r requirements-tools.txt
python tools/validate.py
```

Run the reference harness:
Run the reference package:

```bash
cd harness && npm install && npm test
Expand All @@ -105,7 +105,7 @@ That implementation pressure-tests Open CoT. If Open Lagrange needs a portable s

- **53 RFCs** and a versioned JSON Schema registry.
- New draft schemas for cognitive artifacts and reconciliation results.
- Reference harness coverage for governed execution, policy, delegation, receipts, budgets, and capability manifests.
- Reference package coverage for governed execution, policy, delegation, receipts, budgets, and capability manifests.
- Cross-language validation tooling for schemas and examples.
- Experiment cards and local runbooks under [`docs/experiments/`](./docs/experiments/).

Expand Down
2 changes: 1 addition & 1 deletion datasets/synthetic/generate_scaled.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def build_scaled_traces() -> list[dict[str, object]]:
"benchmark",
"validator",
"trace",
"agent",
"pipeline",
"memory",
"policy",
"budget",
Expand Down
2 changes: 1 addition & 1 deletion datasets/synthetic/task_bank_v1_large.jsonl
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
{"version": "0.1", "task": "Reverse the string 'benchmark'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'benchmark'[::-1] -> 'kramhcneb'", "parent": "s1"}], "final_answer": "kramhcneb"}
{"version": "0.1", "task": "Reverse the string 'validator'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'validator'[::-1] -> 'rotadilav'", "parent": "s1"}], "final_answer": "rotadilav"}
{"version": "0.1", "task": "Reverse the string 'trace'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'trace'[::-1] -> 'ecart'", "parent": "s1"}], "final_answer": "ecart"}
{"version": "0.1", "task": "Reverse the string 'agent'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'agent'[::-1] -> 'tnega'", "parent": "s1"}], "final_answer": "tnega"}
{"version": "0.1", "task": "Reverse the string 'cognitive pipeline'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'cognitive pipeline'[::-1] -> 'tnega'", "parent": "s1"}], "final_answer": "tnega"}
{"version": "0.1", "task": "Reverse the string 'memory'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'memory'[::-1] -> 'yromem'", "parent": "s1"}], "final_answer": "yromem"}
{"version": "0.1", "task": "Reverse the string 'policy'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'policy'[::-1] -> 'ycilop'", "parent": "s1"}], "final_answer": "ycilop"}
{"version": "0.1", "task": "Reverse the string 'budget'.", "steps": [{"id": "s1", "type": "thought", "content": "Read characters from right to left."}, {"id": "s2", "type": "code", "content": "'budget'[::-1] -> 'tegdub'", "parent": "s1"}], "final_answer": "tegdub"}
Expand Down
116 changes: 58 additions & 58 deletions docs/bibliography.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 📚 Annotated Bibliography: Chain‑of‑Thought & LLM Reasoning
# 📚 Annotated Bibliography: Chain‑of‑Thought & LLM Reasoning
*With direct arXiv PDF links where available.*

A curated bibliography covering foundational, structured, search‑based, RL‑based, and mechanistic reasoning research for LLMs. All arXiv‑hosted papers include stable PDF links.
Expand All @@ -7,109 +7,109 @@ A curated bibliography covering foundational, structured, search‑based, RL‑b

## 1. Foundational Chain‑of‑Thought (CoT)

### Wei et al. (2022). *Chain‑of‑Thought Prompting Elicits Reasoning in Large Language Models.*
https://arxiv.org/pdf/2201.11903.pdf
Introduces CoT prompting and demonstrates large gains in arithmetic, symbolic, and commonsense reasoning.
### Wei et al. (2022). *Chain‑of‑Thought Prompting Elicits Reasoning in Large Language Models.*
https://arxiv.org/pdf/2201.11903.pdf
Introduces CoT prompting and demonstrates large gains in arithmetic, symbolic, and commonsense reasoning.
**Relevance:** Defines the modern concept of “reasoning traces.”

### Wang et al. (2022). *Self‑Consistency Improves Chain‑of‑Thought Reasoning in LLMs.*
https://arxiv.org/pdf/2203.11171.pdf
Proposes sampling multiple CoTs and voting for the most consistent answer.
### Wang et al. (2022). *Self‑Consistency Improves Chain‑of‑Thought Reasoning in LLMs.*
https://arxiv.org/pdf/2203.11171.pdf
Proposes sampling multiple CoTs and voting for the most consistent answer.
**Relevance:** Establishes statistical evaluation of reasoning.

### Zhou et al. (2022). *Least‑to‑Most Prompting.*
https://arxiv.org/pdf/2205.10625.pdf
Breaks complex tasks into simpler subproblems.
### Zhou et al. (2022). *Least‑to‑Most Prompting.*
https://arxiv.org/pdf/2205.10625.pdf
Breaks complex tasks into simpler subproblems.
**Relevance:** Motivates structured decomposition fields in reasoning schemas.

---

## 2. Structured Reasoning & Agentic CoT

### Yao et al. (2022). *ReAct: Synergizing Reasoning and Acting in Language Models.*
https://arxiv.org/pdf/2210.03629.pdf
Combines reasoning (“Thought”) with tool actions (“Act”).
**Relevance:** Foundation of modern agent loops.
### Yao et al. (2022). *ReAct: Synergizing Reasoning and Acting in Language Models.*
https://arxiv.org/pdf/2210.03629.pdf
Combines reasoning (“Thought”) with tool actions (“Act”).
**Relevance:** Foundation of modern cognitive pipelines.

### Shinn et al. (2023). *Reflexion: Language Agents with Verbal Reinforcement Learning.*
https://arxiv.org/pdf/2303.11366.pdf
Introduces self‑critique and iterative refinement loops.
### Shinn et al. (2023). *Reflexion: Language Pipelines with Verbal Reinforcement Learning.*
https://arxiv.org/pdf/2303.11366.pdf
Introduces self‑critique and iterative refinement loops.
**Relevance:** Motivates `critique` and `revision` fields in schemas.

### Chen et al. (2022). *Program‑of‑Thoughts (PoT).*
https://arxiv.org/pdf/2211.12588.pdf
Uses executable code as reasoning traces.
### Chen et al. (2022). *Program‑of‑Thoughts (PoT).*
https://arxiv.org/pdf/2211.12588.pdf
Uses executable code as reasoning traces.
**Relevance:** Demonstrates typed, verifiable reasoning.

---

## 3. Search‑Based Reasoning (Beyond Linear CoT)

### Yao et al. (2023). *Tree‑of‑Thoughts: Deliberate Problem Solving with Large Language Models.*
https://arxiv.org/pdf/2305.10601.pdf
Generalizes CoT into a search tree with branching and pruning.
### Yao et al. (2023). *Tree‑of‑Thoughts: Deliberate Problem Solving with Large Language Models.*
https://arxiv.org/pdf/2305.10601.pdf
Generalizes CoT into a search tree with branching and pruning.
**Relevance:** Motivates branching reasoning structures.

### Besta et al. (2023). *Graph‑of‑Thoughts: Solving Problems with Large Language Models and Search.*
https://arxiv.org/pdf/2308.09687.pdf
Extends ToT into graph‑structured reasoning.
### Besta et al. (2023). *Graph‑of‑Thoughts: Solving Problems with Large Language Models and Search.*
https://arxiv.org/pdf/2308.09687.pdf
Extends ToT into graph‑structured reasoning.
**Relevance:** Encourages flexible graph‑based schemas.

### Long‑Horizon CoT Studies
(Various works; no single canonical arXiv source.)
Show that longer reasoning traces improve performance but increase instability.
### Long‑Horizon CoT Studies
(Various works; no single canonical arXiv source.)
Show that longer reasoning traces improve performance but increase instability.
**Relevance:** Motivates metadata like `confidence`, `verification_status`, and `error_type`.

---

## 4. RL‑Based Reasoning (R1‑Style, DeepSeek‑Style, Qwen‑Style)

### DeepSeek‑R1 (2024). *DeepSeek‑R1: Incentivizing Reasoning in LLMs via Reinforcement Learning.*
https://arxiv.org/pdf/2501.12948.pdf
Uses RL with verifiable rewards to produce long, structured reasoning.
### DeepSeek‑R1 (2024). *DeepSeek‑R1: Incentivizing Reasoning in LLMs via Reinforcement Learning.*
https://arxiv.org/pdf/2501.12948.pdf
Uses RL with verifiable rewards to produce long, structured reasoning.
**Relevance:** Aligns with structured scratchpad formats.

### Qwen2.5‑R1 (2024). *Reinforcement Learning for Reasoning.*
https://arxiv.org/pdf/2501.19393.pdf
Documents RL-centric post-training strategies that improve reasoning quality while preserving broad instruction utility.
### Qwen2.5‑R1 (2024). *Reinforcement Learning for Reasoning.*
https://arxiv.org/pdf/2501.19393.pdf
Documents RL-centric post-training strategies that improve reasoning quality while preserving broad instruction utility.
**Relevance:** Supports reward-aware post-training pipelines and reproducibility-oriented run metadata.

---

## 5. Evaluation, Reliability, and Calibration

### Lin et al. (2021). *TruthfulQA: Measuring How Models Mimic Human Falsehoods.*
https://arxiv.org/pdf/2109.07958.pdf
Introduces reliability-oriented evaluation emphasizing truthful behavior under difficult prompts.
### Lin et al. (2021). *TruthfulQA: Measuring How Models Mimic Human Falsehoods.*
https://arxiv.org/pdf/2109.07958.pdf
Introduces reliability-oriented evaluation emphasizing truthful behavior under difficult prompts.
**Relevance:** Motivates safety-aware benchmark slices and failure-mode tracking.

### Kadavath et al. (2022). *Language Models (Mostly) Know What They Know.*
https://arxiv.org/pdf/2207.05221.pdf
Studies calibration and confidence quality in language models.
### Kadavath et al. (2022). *Language Models (Mostly) Know What They Know.*
https://arxiv.org/pdf/2207.05221.pdf
Studies calibration and confidence quality in language models.
**Relevance:** Motivates confidence and uncertainty metrics in verifier outputs.

### Gao et al. (2023). *Pal: Program-Aided Language Models.*
https://arxiv.org/pdf/2211.10435.pdf
Uses executable programs to verify intermediate reasoning steps.
### Gao et al. (2023). *Pal: Program-Aided Language Models.*
https://arxiv.org/pdf/2211.10435.pdf
Uses executable programs to verify intermediate reasoning steps.
**Relevance:** Supports stronger step-level verification beyond format checks.

---

## 6. Open-Source Tooling and Reuse Guidance

### EleutherAI LM Evaluation Harness
https://github.com/EleutherAI/lm-evaluation-harness
De facto open benchmark runner for reproducible LLM evaluation.
### EleutherAI LM Evaluation Harness
https://github.com/EleutherAI/lm-evaluation-harness
De facto open benchmark runner for reproducible LLM evaluation.
**Relevance:** Should be integrated through adapters rather than reimplemented.

### Hugging Face TRL
https://github.com/huggingface/trl
Open-source stack for SFT, DPO, PPO/GRPO-style fine-tuning workflows.
### Hugging Face TRL
https://github.com/huggingface/trl
Open-source stack for SFT, DPO, PPO/GRPO-style fine-tuning workflows.
**Relevance:** Preferred training primitive for alignment and preference experiments.

### vLLM
https://github.com/vllm-project/vllm
High-throughput inference engine with consistent generation behavior for evaluation and serving.
### vLLM
https://github.com/vllm-project/vllm
High-throughput inference engine with consistent generation behavior for evaluation and serving.
**Relevance:** Stabilizes benchmark throughput and reproducibility for large eval runs.

---
Expand Down Expand Up @@ -142,9 +142,9 @@ Reports 56% token reduction vs JSON with native relationship support, type safet

Use this checklist when building, fine-tuning, and validating models with Open CoT:

1. **Always emit structured traces** (`version`, `task`, `steps`, `final_answer`) and validate before scoring.
2. **Use multi-sample evaluation** with consistency metrics, not only single greedy outputs.
3. **Track lineage metadata** (dataset hash, model base, adapter hash, seed, decoding config) for every run.
4. **Enforce data governance gates** (license allowlist, dedup, contamination checks, provenance fields).
5. **Run policy and safety checks** (budget limits, tool restrictions, redaction/audit events) in runtime scripts.
1. **Always emit structured traces** (`version`, `task`, `steps`, `final_answer`) and validate before scoring.
2. **Use multi-sample evaluation** with consistency metrics, not only single greedy outputs.
3. **Track lineage metadata** (dataset hash, model base, adapter hash, seed, decoding config) for every run.
4. **Enforce data governance gates** (license allowlist, dedup, contamination checks, provenance fields).
5. **Run policy and safety checks** (budget limits, tool restrictions, redaction/audit events) in runtime scripts.
6. **Reuse mature OSS tooling** for training/evaluation kernels and keep Open CoT logic focused on schemas/adapters/conformance.
54 changes: 54 additions & 0 deletions docs/cognitive-participation-pivot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Cognitive Participation Pivot

Open CoT defines a portable interface between cognition and execution. The
model contributes fuzzy text processing and structured cognitive artifacts; the
runtime boundary validates, authorizes, executes, records observations, and
reconciles final state.

This distinction matters because natural-language reasoning is useful evidence,
but it is not authority. A reasoning trace can explain how a model reached a
proposal. It cannot grant permission, prove correctness, or bypass policy.

| Common market framing | Open CoT framing |
| --- | --- |
| A model owns the loop | A runtime boundary owns reconciliation |
| Tool use is part of the model experience | Endpoint execution is a governed side effect |
| Prompts carry safety expectations | Capability snapshots and policy gates carry authority |
| Reasoning explains the whole run | Reasoning is cognitive evidence inside a larger audit record |
| A failed tool call is explained by natural language | A failed endpoint execution is recorded as a structured observation and error |
| Safety is mostly instruction-following | Safety is layered validation, permission, budget, and result reconciliation |
| Interfaces are private runtime details | Interfaces are portable schemas that independent runtimes can implement |

## Reasoning Remains Central

Open CoT keeps reasoning traces because they are evidence of cognitive
participation. They help answer:

- What objective did the cognitive step believe it was handling?
- What constraints and assumptions shaped the proposal?
- What uncertainty was present before execution?
- What explanation can be shared safely with reviewers?
- What detailed evidence, if any, must remain restricted or redacted?

The trace is intentionally separated from execution authority. A runtime may use
reasoning evidence during review, auditing, debugging, or evaluation, but it
must reconcile execution intents against capability snapshots, policy gates,
budgets, preconditions, and endpoint results.

## Interface Boundary

Open CoT should standardize portable artifacts:

- Cognitive artifacts.
- Capability snapshots.
- Execution intents.
- Reasoning evidence.
- Observations.
- Policy evaluation records.
- Reconciliation results.
- Error taxonomy.
- Budget and cost boundaries.

Open Lagrange and other implementations can then choose their own durable
runtime, transport, endpoint registry, policy engine, and storage model while
still sharing the same interface contract.
Loading
Loading