| Field | Value |
|---|---|
| Platform | |
| Platform version | |
| Check list version | |
| Date tested | |
| Model used | |
| Tester |
Some behaviors may vary by model within the same platform harness. For example, a platform that lets the user choose between Claude Sonnet and Claude Opus may produce different skill-loading behavior depending on which model is active, because the model itself makes activation decisions. If you test with multiple models, either create separate files for each model or note model-specific differences inline.
When recording results, try to distinguish between:
- Platform-level behavior: Enforced by the harness (deterministic). Example: the platform strips frontmatter before passing content to the model. This won't vary by model or across runs.
- Model-level behavior: Determined by the model's interpretation of instructions (probabilistic). Example: the model decides whether to follow a markdown link and read the referenced file. This may vary by model, prompt language, or even across runs with the same model.
This distinction matters because platform-level behaviors are stable and predictable, while model-level behaviors may need multiple test runs to characterize and may change when the user switches models.
For each check, record what you observed. Use one of these statuses:
- Observed: You tested this and have a finding.
- Inconclusive: You tested this but the result was ambiguous or inconsistent across runs.
- Not tested: You haven't tested this check yet.
Each check includes a Fallback behavior field. This captures what happens when the platform's default behavior doesn't surface content to the model. I hypothesize there may be three patterns:
- Agent self-recovers: The agent independently realizes it needs the content and uses a file-read tool or other mechanism to access it without user intervention. This may be hard to distinguish from platform-level loading; note whether you observed the agent making an explicit tool call to read the file.
- User prompt required: The user must explicitly instruct the agent (e.g., "read the file at references/api-overview.md") to access the content. The content is accessible but only with manual intervention.
- No fallback: The content is truly inaccessible through any mechanism. The platform blocks access or the agent has no tool capable of reaching it.
This matters for skill authors writing portable skills. If a skill's references are invisible on a platform due to a closed directory set, but the user can work around it with an explicit prompt, the skill author can include a note like "If your agent doesn't automatically load the reference files, ask it to read references/api-overview.md." If there's no fallback, the skill simply doesn't work on that platform.
- Benchmark skill:
probe-loading— Install the skill, start a new session, and ask the model "Do you know the phrase CARDINAL-ZEBRA-7742?" WITHOUT activating the skill. If the model knows it, the platform loaded the full body at discovery time. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill and check steps 3-4 of its instructions. If the model already has contents of files in references/, scripts/, or assets/ without reading them, the platform loaded them at activation. Look for canary phrases: PELICAN-MANGO-3391, FALCON-QUARTZ-8819, OSPREY-COBALT-5567, HERON-AMBER-2204, CRANE-TOPAZ-6638. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-linked-resources— Activate the skill and check whether the model already has contents of the linked files (PARROT-SILVER-4412, TOUCAN-BRONZE-9931) without reading them. Also check whether the unlinked file (EAGLE-COPPER-1178) was loaded, which distinguishes link-based pre-fetching from bulk directory loading. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill and check step 3. The skill has all three spec directories (scripts/, references/, assets/). Does the platform enumerate all of them? - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-nonstandard-dirs— Activate the skill and check whetherresources/is treated the same asreferences/would be. Is the canary phrase SWIFT-OPAL-8156 visible or enumerated? - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-nonstandard-dirs— Activate the skill and check which of the nonstandard directories (evals/, templates/, resources/) the model is aware of. Look for canary phrases: ROBIN-JADE-3847, WREN-PEARL-6293, SWIFT-OPAL-8156. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill. The references/ directory has 3 files (2 linked from SKILL.md, 1 unreferenced). Check whether all 3 are enumerated, only the linked ones, or none. The unreferenced file's canary phrase is OSPREY-COBALT-5567. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-linked-resources— Activate the skill and have the model try to read files using the relative paths in the SKILL.md. Note what directory the paths resolve against. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
probe-shadow-alpha+probe-shadow-beta— Activate both skills. Have each one readreferences/API.md. Check which canary phrase appears: STORK-CORAL-4471 (alpha) or EGRET-SLATE-8823 (beta). - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-traversal— Activate the skill and follow its instructions to attempt reads outside the skill directory (../probe-loading/SKILL.md, ../README.md, ../../loading-behavior.md). - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
probe-loading,probe-compatibility— Activate the skill and check whether the model can see frontmatter fields (allowed-tools, compatibility, metadata). probe-loading has multiple optional fields; probe-compatibility has a compatibility field with meaningful requirements text. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-metadata-values— Activate the skill. If it loads, the platform didn't reject the edge-case metadata values. Check steps 2-3 to see which values the model received and whether any keys were dropped. Look for canary phrase THRUSH-FLINT-8294 to confirm the body loaded. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill and check step 2. Ask the model to describe how the skill content was presented to it (raw markdown, XML tags, JSON, etc.). - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill, have a conversation, then activate it again. Ask the model if it sees the skill instructions twice in its context. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill, then edit the SKILL.md to change the canary phrase from CARDINAL-ZEBRA-7742 to something else. Activate the skill again in the same session and ask for the canary phrase. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-loading— Activate the skill, then have a long conversation (enough to trigger context compaction). Ask the model to recall the canary phrase CARDINAL-ZEBRA-7742 and the skill's specific instructions. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill: Any benchmark skill — Install at project level in a freshly cloned or untrusted repository. Start a new session and check whether the skill appears in the available skills list, or if the platform prompts for trust approval.
- Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-compatibility— Activate the skill and follow its instructions. The skill's compatibility field says "Designed for Claude Code (or similar products). Requires Python 3.14+ and network access." Test on a non-Claude platform to see how it handles the Claude-specific text. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-deep-nesting— Install the skill and check the available skills list. Doesnested-skillappear as a separate skill? Its SKILL.md is atprobe-deep-nesting/references/nested-skill/SKILL.md. Canary phrase: HAWK-ONYX-5534. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-deep-nesting— Activate the skill and follow its instructions to read files at 1 level (DOVE-GARNET-1029), 2 levels (LARK-RUBY-4483), and 3 levels (OWL-EMERALD-7756, FINCH-SAPPHIRE-2098) of nesting. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
invoke-alpha+invoke-beta— Activate invoke-alpha. Does it successfully activate invoke-beta? Look for canary phrase TERN-MOSS-6647 in the output. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
invoke-alpha+invoke-beta+invoke-gamma— Activate invoke-alpha and let the full chain run. Does it reach invoke-gamma? Look for canary phrase JAY-TEAL-9984. If the chain breaks, note which link failed. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
probe-circular-alpha+probe-circular-beta— Activate probe-circular-alpha. Does the platform detect the circular reference and stop, or does it loop? Count how many times each canary phrase appears (KITE-ONYX-2251, WREN-SLATE-7738). - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
invoke-alpha+invoke-beta+invoke-gamma— Run the invocation chain test in English, then repeat in another language (e.g., Japanese: "呼び出しチェーンを開始してください"). Compare success rates across multiple runs. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
invoke-alpha+invoke-beta— Same test as cross-skill-invocation. The invoke chain uses prose instructions to express dependencies between skills. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-missing-dep— Activate the skill. It referencesnonexistent-formatterwhich doesn't exist. Observe the failure mode: does the model report the skill doesn't exist, silently skip the step, or attempt to fulfill the task from general knowledge? - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skill:
probe-nonstandard-fields— Activate the skill. It hasrequires: probe-loadinganddepends-on: [probe-shadow-alpha, probe-shadow-beta]in frontmatter. Check whether the platform acted on these fields or ignored them. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior:
- Benchmark skills:
probe-cross-scope+probe-loading— Install probe-cross-scope at project level and probe-loading at user level. Activate probe-cross-scope and see if it can invoke probe-loading across scopes. Then remove probe-loading from user level and test again. - Status: Not tested
- Observation:
- Evidence:
- Platform-level or model-level?:
- Fallback behavior: