Skip to content

ClayKa/machine-discoverable-concepts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine-Discoverable Concepts Reference Code

This repository contains the model-agnostic reference pieces for the paper "Inductive Biases for Machine-Discoverable Concepts":

  • Discoverability Composite Score (DCS) normalization and retention rules.
  • Locality/stability, compositional workspace, and redundancy-reduction losses.
  • A lightweight auxiliary concept interface over frozen hidden states.
  • Metadata-free test-time candidate selection.
  • A CSV CLI for scoring candidate-level measurements.

It is not a full reproduction package for the paper's Qwen3-VL, Phi, and Gemma experiments. Full reproduction also requires backbone-specific hidden-state collection, hook registration, SAE/readout candidate extraction, task datasets, candidate refresh schedules, and intervention evaluation loops.

Install

python -m pip install -e .

or install the direct requirements:

python -m pip install -r requirements.txt

Quick Check

python scripts/evaluate_dcs.py \
  --input examples/candidate_metrics.csv \
  --output outputs/scored_candidates.csv \
  --summary outputs/dcs_summary.csv \
  --all-summary outputs/dcs_summary_all.csv \
  --json-summary outputs/dcs_summary.json \
  --group-by model regime

By default, --summary reports DCS over the retained concept set C*, matching the regime-level DCS definition in the paper. Use --summary-scope all for diagnostic summaries over every candidate.

Candidate Metrics Schema

The scoring CLI expects one row per candidate concept or candidate set. Required columns are:

Column Meaning
stability_match Seed/split/prompt recovery match rate.
active_family_count Number of task families where the candidate is active.
target_gain_pp Target behavior gain in percentage points.
off_target_drift_pp Off-target behavior drift in percentage points.
sufficiency_gain_pp Expected-direction behavior change in percentage points.
pairwise_synergy_pp Pairwise composition synergy in percentage points.

Optional grouping columns such as model, regime, task_family, and candidate_type are preserved and can be passed to --group-by.

Main Modules

  • discoverability.dcs: DCS component normalization, retention filtering, and retained/all-candidate summaries.
  • discoverability.interface: concept readouts, normalized intervention directions, slot assignments, matched-norm interventions, and simple grouping.
  • discoverability.regime_losses: objective terms for the auxiliary interface.
  • discoverability.policy: metadata-free selection from retained candidates.

Tests

python -m unittest discover -s tests

The tests cover the scoring contract, retained-only summaries, policy behavior, loss functions, and intervention interface shapes.

Paper Alignment

The default DCS configuration mirrors Appendix Table A14:

  • stability range [0.30, 0.82], retention threshold >= 0.50
  • reuse range [1, 4], retention threshold >= 2 task families
  • locality range [0.40, 0.86], retention thresholds target_gain >= 2.0pp and off_target_drift <= 5.0pp
  • sufficiency range [1.0, 12.0], retention threshold >= 2.0pp
  • compositionality range [-0.5, 5.5], retention threshold >= 0.5pp

Validation-selected ranges and thresholds should be fixed before test evaluation. Test metadata should not be used by the selection policy.

About

Reference implementation for Discoverability Composite Score (DCS), concept-forming losses, and metadata-free concept selection for machine-discoverable concepts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages