Capability Cartography Layer 3

Capability Cartography Layer 3 is the direct successor to Capability-Cartography-Layer-2. It keeps the full Layer 2 spine — measurement, classification, failure atlases, visualization, notebook execution, and agent integration — but extends it with causal explanation: estimator registries, identification-aware failure classification, middle-regime analysis, and transfer diagnostics.

This repository is designed to sit on top of four companion resources:

pageman/sutskever-30-implementations as the experimental substrate
pageman/Sutskever-Agent as the orchestration and explanatory layer
pageman/gpt1-from-sutskever30 as the first controlled transformer wind tunnel
pageman/sutskever-30-beyond-numpy as the multi-backend triangulation layer (new in Layer 3)

These links are preserved explicitly in code and artifacts. The adapter layer records canonical repository URLs and configured local roots so every cartography export can retain its provenance.

External Baseline Note

This repository should be read alongside the external note:

Downloads/In-Paper Causality Analysis- 30 Sutskever Papers × 27 Estimators × 6 Backends × CCL2.md

That note is useful for the estimator inventory and source list, but it is not identical to the current Layer 3 outputs. In particular:

the note reports 22 CONFIRMED / 8 CONDITIONAL / 0 UNCONFIRMED
the current Layer 3 artifacts report 27 CONFIRMED / 0 CONDITIONAL / 3 UNCONFIRMED
the note treats SplitUP ℓ1 as the universal best estimator
the current Layer 3 implementation usually selects spaceTSIV, IVW, TS-IV, or SplitUP (dense) as the current lowest-MSE estimator, depending on paper and regime

When these differ, treat artifacts/layer3/causal/*.json as the repository's current source of truth.

The Layer Progression

Layer	Question	Verb	Key Addition
1	"What happened?"	Measures	Schemas, sweeps, surfaces, validation, falsifiable laws
2	"What kind of failure is this?"	Classifies	Failure atlas, visualization, notebook execution, agent briefs
3	"Why did it fail?"	Explains	Causal registry, estimator sweep, causal atlas, middle-regime analysis, transfer diagnostics

Reader Guide

If you want the shortest reader-friendly interpretation of what the current results do and do not establish, start with TAO_ASSESSMENT.md. It evaluates the repository against three specific mysteries that Layer 2 left unresolved:

The causal explanation for why retrieval is harder
Whether these laws transfer to GPT-4-scale models
A mathematical theory of the "middle regime" itself

Narrative Arc And Methodological Arc

Narrative Arc

Layer 1 asked what happened. Layer 2 asked what kind of failure happened. Layer 3 asks why a mechanism does or does not causally explain a capability. The arc of the project is therefore a progression from measurement, to classification, to explanation, and finally to explicit criteria for when a causal claim is justified.

Methodological Arc

The methodological arc mirrors that narrative progression: run the system end to end, preserve comparability with earlier layers, make estimator assumptions explicit, separate historical baseline notes from current outputs, and require verdicts to follow an explicit policy rather than a vague confidence threshold.

Story Boxes

Story Box 1: From Prototype To Instrument Layer 3 is not only meant to run; it is meant to support inspectable causal claims.

Story Box 2: Baseline Is Not Ground Truth Historical notes and current artifacts can disagree; when they do, the current exported artifacts are the source of truth for the repository.

Story Box 3: Retrieval Is The Boundary Case Retrieval papers remain the hardest regime because unpaired structure reduces estimator applicability and exposes the limits of naive causal claims.

Story Box 4: The Framework Must Judge Itself The project now evaluates not only papers, but also the reliability of its own verdicting, ranking, and pathology labeling.

Method Boxes

Method Box 1: End-to-End Validation Re-run tests and demo after each structural change so the repo is treated as an empirical instrument, not static code.

Method Box 2: Cross-Layer Consistency Compare Layer 3 against Layers 1 and 2 at the package, artifact, and documentation levels to preserve continuity of the research program.

Method Box 3: Explicit Verdict Policy Assign CONFIRMED, CONDITIONAL, and UNCONFIRMED through a named policy with applicability and consensus thresholds.

Method Box 4: Estimator Ranking Discipline Only rank estimators as “best” when the implementation is exact rather than proxy or fallback.

Method Box 5: Normalized Pathology Prediction Predict causal-atlas labels in normalized feature space so large-magnitude features do not dominate the classification.

What Layer 3 Adds

Five New Modules

Module	Purpose	Addresses
`causal_registry.py`	Registry of 27 IV/GMM/MR estimators with applicability conditions	The estimator landscape
`estimator_sweep.py`	Apply all applicable estimators to each measured record; compute consensus	"Which estimator works for which paper?"
`causal_atlas.py`	Classify failures by cause (unpaired_bias, weak_instrument, etc.) not just symptom	Mystery #1: Why retrieval is harder
`middle_regime.py`	Schur et al. (2026) regime classification, bias computation, boundary detection	Mystery #3: Mathematical theory of the middle regime
`transfer_diagnostics.py`	Flag each finding as scale-invariant or scale-dependent	Mystery #2: Whether laws transfer

Extended Modules

Module	What Changed
`schemas.py`	Added `CausalEstimator`, `EstimatorResult`, `CausalRecord`, `MiddleRegimeProfile`, `TransferDiagnostic`
`adapters.py`	Added `BeyondNumpyAdapter` for the multi-backend substrate
`orchestration.py`	Extended pipeline: L2 steps → estimator sweep → causal atlas → middle regime → transfer diagnostics
`demo.py`	Extended demo with all Layer 3 steps

Inherited Modules (Unchanged from Layer 2)

boundary.py, compressibility.py, datasets.py, descriptors.py, execution.py, failure_atlas.py, metrics.py, notebook_runner.py, provenance.py, runner.py, storage.py, surfaces.py, sweeps.py, validation.py, visualization.py, agent_integration.py

The 27 Estimators

The causal registry contains estimators from seven families:

Family	Count	Key Members
classical_IV	7	Naive OLS, 2SLS, LIML, Fuller-k, JIVE, RJIVE, SS-IV
two_sample_IV	2	TS-IV, TS-2SLS
unpaired_GMM	2	UP-GMM, UP-GMM l1
splitUP	3	SplitUP (dense), SplitUP l1, SplitUP (analytic)
sparse_regularized	8	l1-Reg 2SLS, Lasso-GMM, GMM-Lasso, FGMM, Desparsified GMM, Post-Double Selection, spaceIV, spaceTSIV
MR_robust	5	IVW, MR-Egger, Weighted Median, Mode-Based MR, MR-PRESSO

SplitUP is the only family consistent across all regimes: finite-dimensional instruments, high-dimensional instruments, paired data, unpaired data, dense effects, and sparse effects.

Naming Conventions

The code uses stable ASCII estimator identifiers in JSON and Python APIs:

SplitUP_L1 is rendered in prose as SplitUP l1 and corresponds to the older note's SplitUP ℓ1
UP_GMM_L1 is rendered in prose as UP-GMM l1 and corresponds to UP-GMM ℓ1
L1_Reg_2SLS is rendered in prose as l1-Reg 2SLS and corresponds to ℓ1-Reg 2SLS

If you are comparing old figures or notes against the current repo, match on the code identifier first, then on the human-readable label.

Source Baseline

The current estimator inventory and causal framing are grounded in:

Schur, F. et al. (2026). Many Experiments, Few Repetitions, Unpaired Data, and Sparse Effects: Is Causal Inference Possible? arXiv:2601.15254
Pajo, P. (2026). Finite-Sample Performance of SplitUP in Many-Environments Unpaired IV
Pajo, P. (2026). Capability Cartography Layer 2
Pajo, P. (2026). Sutskever 30 Beyond NumPy
Pajo, P. (2026). Sutskever 30 Implementations

Causal Atlas Pathology Labels

Layer 2's failure atlas uses symptom labels: collapse, generalization_risk, stable_reasoning.

Layer 3's causal atlas uses pathology labels:

Label	Meaning	Typical Papers
`stable_identification`	Mechanism → capability causal effect is identifiable by most estimators	Architecture papers (P02-P08, P10-P18, P20-P22, P26-P27)
`unpaired_bias`	Causal effect is attenuated because data is unpaired; most paired-IV estimators inapplicable	Retrieval papers (P28-P30)
`weak_instrument`	Instruments are too weak for reliable identification despite many applicable estimators	—
`sparse_identification_failure`	Sparse effects but restricted eigenvalue condition not met	—
`insufficient_environments`	Too few environments/instruments for high-dim methods	Theory papers (P01, P19, P23-P25)
`exclusion_violation_risk`	High average bias across estimators suggests exclusion restriction may be violated	—

Middle-Regime Theory

The middle_regime.py module implements the Schur et al. (2026) mathematical framework:

Regime classification: classify_regime(m, r, d, s_star) → human-readable label
Bias computation: measurement_error_bias(Q, r_tilde, b) → attenuation factor Q/(Q+r̃b)
Boundary detection: identifies the m-value where TS-IV transitions from consistent to biased
Profile generation: per-paper profiles with splitup_needed flag

Transfer Diagnostics

Current analysis: 6/11 findings are scale-invariant, 5/11 are scale-dependent.

Scale-invariant (transfer to GPT-4):

SplitUP removes measurement-error bias (mathematical theorem)
Retrieval papers have fewer applicable estimators (structural property)
TS-IV bias formula Q/(Q+r̃b) (asymptotic result)
Estimator taxonomy (theoretical properties)
Theory papers remain CONDITIONAL (uncomputability)
Unpaired-bias pathology for retrieval (data structure property)

Scale-dependent (do NOT transfer without re-measurement):

retrieval_dependence coefficient = -0.0303
Onset thresholds (scale=32, data=32768)
Failure atlas collapse counts
CCL2 measured law R²
Task family coefficient magnitude

Installation

git clone https://github.com/pageman/Capability-Cartography-Layer-3
cd Capability-Cartography-Layer-3
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Optional companion repositories

export SUTSKEVER30_ROOT=/path/to/sutskever-30-implementations
export GPT1_WIND_TUNNEL_ROOT=/path/to/gpt1-from-sutskever30
export SUTSKEVER_AGENT_ROOT=/path/to/Sutskever-Agent/sutskever-agent
export BEYOND_NUMPY_ROOT=/path/to/sutskever-30-beyond-numpy

Preflight check

python3 - <<'PY'
from capability_cartography.adapters import GPT1WindTunnelAdapter, NotebookSubstrateAdapter

for label, diagnostic in [
    ("substrate", NotebookSubstrateAdapter().diagnostic_summary("22_scaling_laws")),
    ("wind_tunnel", GPT1WindTunnelAdapter().diagnostic_summary()),
]:
    print(f"{label}_available:", diagnostic["available"])
    print("configured_root:", diagnostic["configured_root"])
    expected_target = diagnostic.get("notebook_path") or diagnostic.get("implementation_path")
    print("expected_target:", expected_target)
    print("searched_candidates:")
    for path in diagnostic["searched_candidates"]:
        print(" -", path)
    print()
PY

If the GPT-1 repo is missing, measured runs now fall back to a synthetic-backed path instead of crashing. The exported measured artifacts set measured_mode: false and include fallback_reason plus wind_tunnel_diagnostics. To keep strict fail-fast behavior, call the measured runner with allow_fallback=False.

If the substrate repo is missing, notebook execution now emits a structured fallback report instead of failing obscurely. The notebook report includes executed: false, fallback_reason, and substrate_diagnostics.

Quick Start

# Run tests
python3 -m unittest discover -s tests -p 'test_*.py'

# Run demo
python3 -m capability_cartography.demo

Layer 3 Artifact Tree

artifacts/layer3/
├── causal/
│   ├── causal_atlas.json
│   ├── causal_records.json
│   ├── estimator_heatmap.png
│   ├── estimator_sweep_summary.json
│   ├── middle_regime_summary.json
│   ├── regime_map.png
│   ├── transfer_diagnostics.json
│   └── verdict_dashboard.png
├── failure_atlas/
│   └── failure_atlas.json
├── measured/
│   ├── measured_laws.json
│   ├── measured_records.csv
│   └── measured_summary.json
├── notebooks/
│   ├── 22_scaling_laws.execution.json
│   └── 22_scaling_laws_figures/
├── plots/
│   ├── onset_surface.png
│   └── phase_regions.png
├── sweeps/
│   ├── sweep_records.csv
│   └── sweep_summary.json
└── agent/
    ├── agent_brief.json
    └── agent_workflow.yaml

Citation

@misc{capability-cartography-layer-3-2026,
  author    = {Paul "The Pageman" Pajo, pageman@gmail.com},
  title     = {Capability-Cartography-Layer-3: from classification to causal explanation},
  year      = {2026},
  url       = {https://github.com/pageman/Capability-Cartography-Layer-3},
  note      = {Extends Layer 2 with causal estimator sweeps, identification-aware failure atlases,
               Schur et al. middle-regime analysis, and scale-transfer diagnostics.}
}

License

This repository is released under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
capability_cartography		capability_cartography
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SESSION_CHECKPOINT.md		SESSION_CHECKPOINT.md
TAO_ASSESSMENT.md		TAO_ASSESSMENT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capability Cartography Layer 3

External Baseline Note

The Layer Progression

Reader Guide

Narrative Arc And Methodological Arc

Narrative Arc

Methodological Arc

Story Boxes

Method Boxes

What Layer 3 Adds

Five New Modules

Extended Modules

Inherited Modules (Unchanged from Layer 2)

The 27 Estimators

Naming Conventions

Source Baseline

Causal Atlas Pathology Labels

Middle-Regime Theory

Transfer Diagnostics

Installation

Optional companion repositories

Preflight check

Quick Start

Layer 3 Artifact Tree

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Capability Cartography Layer 3

External Baseline Note

The Layer Progression

Reader Guide

Narrative Arc And Methodological Arc

Narrative Arc

Methodological Arc

Story Boxes

Method Boxes

What Layer 3 Adds

Five New Modules

Extended Modules

Inherited Modules (Unchanged from Layer 2)

The 27 Estimators

Naming Conventions

Source Baseline

Causal Atlas Pathology Labels

Middle-Regime Theory

Transfer Diagnostics

Installation

Optional companion repositories

Preflight check

Quick Start

Layer 3 Artifact Tree

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages