This guide describes how to define new slash commands for the DeepExtractIDA
Agent Analysis Runtime. In an installed workspace, commands live under
.claude/commands/ inside a DeepExtractIDA_output_root, while extraction data
remains at workspace root in extracted_code/ and extracted_dbs/.
It covers the structural requirements for command files (sections 1-3) and the engineering practices for building reliable, consistent command behavior (sections 4-18).
Commands are defined as Markdown files in .claude/commands/:
<DeepExtractIDA_output_root>/
├── extracted_code/
├── extracted_dbs/
├── .claude/
│ ├── commands/
│ │ ├── registry.json # Machine-readable command contracts
│ │ ├── README.md # Command catalog
│ │ └── my-command.md # Command definition and workflow
│ ├── helpers/
│ └── tests/
└── hooks.json
In this source repository, those installed files live at repository-root
commands/, helpers/, and tests/. The guidance below uses the installed
workspace paths because that is how the runtime is consumed inside a real
DeepExtractIDA output root.
A command file should follow a standard structure to ensure consistency and clarity.
Use a clear, descriptive title.
Explain what the command does and provide usage examples.
# My Command
## Overview
Brief description of the command's purpose.
Usage:
- `/my-command <module> <target>`Provide a step-by-step guide for the agent to execute the command.
- Resolution: How to find the target module or function.
- Execution: Which skills or scripts to run, including specific CLI flags.
- Synthesis: How to combine results from multiple steps into a final report.
Describe what the final response to the user should look like.
Add an entry to .claude/commands/registry.json with:
purpose: Brief description of the commandfile: The.mdfilename (e.g.,my-command.md)skills_used: Array of skill names this command orchestratesagents_used: Array of agent names this command delegates to (empty array if none)parameters: Human-readable parameter pattern (e.g.,<module> <function>)grind_loop: Boolean --trueif the command creates a session-scoped scratchpadworkspace_protocol: Boolean --trueif the command creates a workspace run directory
The infrastructure test suite validates that every command .md file is
registered and every registered file exists on disk. skills_used and
agents_used are cross-checked against their respective registries.
If the command involves multiple steps or large data payloads, it must follow the Workspace Pattern:
- Create a run directory under
.claude/workspace/. - Pass
--workspace-dirand--workspace-stepto all skill scripts. - Use the run manifest to track progress.
Command definitions live under .claude/commands/, but their inputs and outputs
usually live outside .claude/:
- Inputs:
extracted_dbs/,extracted_code/,extraction_report.json - Runtime artifacts:
.claude/workspace/,.claude/cache/,.claude/hooks/ - Saved reports: typically
extracted_code/<module>/reports/
For commands that process multiple discrete items (e.g., a list of functions), use the Grind Loop Protocol:
- Create a session-scoped scratchpad at
.claude/hooks/scratchpads/{session_id}.md. - List items to be processed.
- The framework will automatically re-invoke the agent until all items are checked.
Every command should validate its arguments before invoking skill scripts. Use
the command_validation helper to catch bad input early with clear error
messages:
from helpers.command_validation import validate_command_args
result = validate_command_args("triage", {"module": "appinfo.dll"})
if not result.ok:
for err in result.errors:
print(err)
# Stop -- do not proceed to skill scriptsFor commands that accept a module and/or function argument, this resolves the
DB path and function ID upfront, making them available in result.resolved.
The _COMMAND_REQUIREMENTS dict in helpers/command_validation.py defines
what each command requires; new commands should add their entry there.
In the command .md file, add this as the first step:
### Step 0: Preflight Validation
Validate arguments using `helpers.command_validation.validate_command_args("<command_name>", {...})`.
If validation fails, report the errors to the user and stop. On success, use
`result.resolved["db_path"]` (and `result.resolved["function_id"]` if applicable)
for all subsequent skill script calls.Both the main agent and any subagents it spawns should always prefer existing skill scripts over ad-hoc inline logic. Skills encode tested workflows, use the helpers library correctly, and produce consistent output formats.
- Use skill scripts for any operation a skill covers. Don't write inline
Python to classify functions when
classify_function.pyexists. Don't hand-roll call graph traversal whenchain_analysis.pydoes it. - Subagents must follow skills too. When delegating to a subagent, instruct it to read the relevant SKILL.md and use the skill's scripts. Don't re-explain the workflow in the subagent prompt when a skill already documents it.
- Read the SKILL.md before invoking scripts. Skills define argument conventions, output formats, and edge-case handling. Skipping the skill and calling scripts directly risks missing flags, misinterpreting output, or hitting undocumented preconditions.
- Fall back to helpers only when no skill covers the operation. For one-off operations not covered by any skill, use the helpers library directly (see section 3.3). Never write raw SQL, manual path resolution, or ad-hoc API classification.
For how skills work and how to author new ones, see skill_authoring_guide.md.
Every new or modified command must be validated by the automated test suite.
Mandatory checks after any command change:
-
Run the full infrastructure suite:
cd <DeepExtractIDA_output_root>/.claude && python -m pytest tests/test_infrastructure_consistency.py -v. This validates thatregistry.jsonentries match.mdfiles on disk, thatskills_usedandagents_usedreference registered entities, and that the command README listings are consistent. -
Add integration tests when a command wires in new skills or agents. Create or update a test file in
.claude/tests/that verifies:- The command's
skills_usedinregistry.jsonlists the expected skills - The command's
.mdfile references the skill scripts by name - Negative checks: unrelated commands do not gain the new dependency
- The command's
-
Run the full test suite before considering the change complete:
cd <DeepExtractIDA_output_root>/.claude && python -m pytest tests/ -v. A green suite across all test files is the minimum bar.
Test file convention: Integration tests that validate how a skill is wired across multiple commands/agents belong in test_<skill>_integration.py. Command-specific functional tests belong in test_<command_name>.py.
The following sections (4-18) are generic engineering practices for building reliable, consistent, and maintainable slash commands. They apply to any multi-step command in the runtime.
The helpers package lives at .claude/helpers/ and is not installed as a system package.
Any Python code that does from helpers.X import Y must be run with .claude/ as the working
directory (or on sys.path).
Correct pattern for inline Python snippets:
cd <DeepExtractIDA_output_root>/.claude && python -c "
from helpers.validation import validate_workspace_data
status = validate_workspace_data('..')
...
"Correct pattern for skill scripts (they manage their own path setup, so run from workspace root):
python .claude/skills/<skill>/scripts/<script>.py --db extracted_dbs/foo.dbRunning inline helpers.* imports from the workspace root will always produce
ModuleNotFoundError: No module named 'helpers'.
When running from .claude/, the DeepExtractIDA output root is one level up:
...
Pass .. to any helper that accepts a workspace root parameter (e.g., validate_workspace_data('..')).
When commands need inline logic, always use the .claude/helpers/ library. Never
write raw SQLite queries, hand-parse function names, roll custom path resolution,
or build ad-hoc API classification. Common violations:
- Raw
SELECT * FROM functionsinstead ofopen_individual_analysis_db()+resolve_function() - Manual string splitting on function names instead of
parse_class_from_mangled() - Custom path joining instead of
resolve_db_path()/resolve_tracking_db()
Using helpers ensures consistency across commands, prevents subtle bugs from divergent implementations, and lets every command benefit when a helper improves.
Create a run directory under .claude/workspace/ for any command that invokes two
or more skill scripts. Write full payloads to disk; pass only compact summaries
through coordinator context.
.claude/workspace/<module>_<goal>_<timestamp>/
├── manifest.json
├── step_a/
│ ├── results.json # full payload
│ └── summary.json # compact summary
└── step_b/
├── results.json
└── summary.json
Why: Large JSON payloads in context degrade agent reasoning quality and hit token limits. Filesystem handoff keeps context lean while preserving full data for on-demand access.
Use manifest.json to track which steps have completed, which failed, and
where their outputs live. Never rely on implicit state or agent memory to
determine pipeline progress.
Always invoke skill scripts with --workspace-dir <run_dir> and
--workspace-step <step_name> so each step knows where to write and how
to label itself in the manifest.
Read results.json only when you need the data for synthesis, ranking, or
a targeted follow-up. Default to reading summary.json for decision-making.
Once a function, module, or entity is resolved, use its unique identifier
(e.g., --id <function_id>) in all subsequent invocations. IDs are unambiguous
and avoid re-resolution edge cases like overloaded names or partial matches.
- Quick lookup first (cheapest, most common case)
- Cross-dimensional search second (when the term might match strings, APIs, or classes)
- Skill-based fallback third (when the above miss)
Don't jump to expensive search when a simple index lookup suffices.
Check whether a resolved entity is library boilerplate (WIL, CRT, STL, WRL, ETW). Surface this to the user or adjust priority accordingly -- library code is almost always lower-priority than application code for analysis.
If a command requires a target (function, module, class) and the user didn't provide one, ask explicitly. Don't guess or pick a default silently.
Don't just check that arguments exist -- validate their format and consistency before invoking skill scripts. Catch bad input at the command level rather than letting it propagate into cryptic script errors downstream.
- Verify IDs are numeric with
validate_function_id() - Verify paths resolve with
resolve_db_path() - Check mutually exclusive options aren't both set
- Reject out-of-range values (e.g., negative depth) early
All skill scripts support --json. When the coordinator or a subagent will
parse the output, always use it. Never parse human-formatted tables in automation.
- stdout: Data only (JSON when
--json, tables/text otherwise) - stderr: Progress messages, warnings, structured errors
This enables reliable piping and programmatic consumption.
If steps have no data dependencies between them, launch them in parallel. Explicitly document which steps are independent in the command definition so future maintainers preserve the parallelism.
Don't run a step whose preconditions aren't met. Check the output of earlier steps and skip gracefully when there's nothing to do. For example, a backward trace is pointless if no dangerous APIs were found.
State which steps depend on which. A simple grouping like "Run A + B + C in parallel; D depends on A; E + F + G run in parallel after the first batch" makes execution order unambiguous.
Skill scripts cache expensive results (TTL 24h, keyed by DB mtime). This means first runs are slower than subsequent ones. Be aware of caching behavior when designing commands:
- Pass
--no-cacheto force fresh analysis when the user suspects stale results or after the underlying DB has been regenerated - If a command pipeline has long cold-start times, mention it in the command documentation so the user knows to expect it
- Don't cache results that depend on parameters beyond the DB path -- the cache key may not capture the full input space
Name subagents after the analysis step and target entity:
- Good:
"Build security dossier for AiCheckLockdown" - Good:
"Trace call chain from AiCheckLockdown" - Bad:
"Raw JSON content retrieval" - Bad:
"File read"
Descriptive names make logs readable and debugging straightforward.
Use the most appropriate subagent_type for the work:
re-analystfor explanation tasks and classification enrichmentsecurity-auditorfor security finding verification and severity validationcode-lifterfor batch lifting with shared contexttriage-coordinatorscripts for triage and multi-skill orchestrationtype-reconstructorscripts for struct/class reconstruction
Don't route everything through generalPurpose when a specialized agent
exists. In particular, use security-auditor (not re-analyst) for
skeptical verification of security findings -- it has adversarial reasoning
guidance, severity criteria, and a "rationalizations to reject" table.
When a subagent's job is to verify, review, or validate work done by an earlier
step, launch it with readonly: true. Validation tasks should never modify
outputs -- they only read data and return a judgment.
Different fields use different scales. Misinterpreting a 0-100 percentage as a 0-1 fraction (or vice versa) produces wrong conclusions. Always state the scale alongside any metric you define or consume.
Examples of common pitfalls:
| Field | Scale | Trap |
|---|---|---|
param_surface |
dict | Structured metadata: has_buffer_size_pair, has_string_pointer, has_com_interface, etc. |
noise_ratio |
0.0-1.0 | 0.48 = 48% library boilerplate |
attack_score |
0.0-1.0 | Higher = more attractive target |
When citing any metric in output, include both the raw number and its
interpretation. Never present a bare 0.2 without clarifying whether it
means 0.2% or 20%.
An empty or negative result from a tool is still information. If a function is not detected as an entry point, that itself is a data point (internal-only function). If a search returns no matches, record that explicitly rather than silently omitting the section. Empty results prevent false conclusions later.
Every command that produces a report should define an exact section structure with heading levels, table schemas, and formatting rules. Don't leave layout to improvisation -- consistency across runs makes reports comparable.
| Data Shape | Format |
|---|---|
| Key-value metadata | Two-column table |
| Hierarchical relationships | ASCII tree |
| Multi-attribute rankings | Multi-column table |
| Ordered findings | Numbered list with prefixes |
| Code flow | Fenced code block |
Use raw signatures, names, and values from the database. Don't reconstruct, rename, or "improve" data in a reporting context -- that belongs in a separate transformation step (like lifting).
Applies to commands that produce scored assessments or rankings (e.g., /audit,
/triage, /full-report). Commands that only retrieve or display data can skip
this section.
Every score or rating must trace back to explicit data sources, named fields, and concrete thresholds. If a human can't reproduce the same score from the same data, the rubric is too vague.
For each scoring dimension, list:
- Data sources: Which fields feed into this dimension
- Inputs: The specific values extracted from those sources
- Thresholds: Concrete cutoffs for each level (e.g.,
>= 500instructions = HIGH)
Automated tools can underreport. When heuristics or manual evidence contradict a tool's output, apply an explicit override:
- For categorical fields (e.g., entity type), set to the heuristic-confirmed value
- For numeric fields (e.g., risk scores), treat the tool's value as unknown -- don't use a wrong zero in scoring formulas as though it were a real measurement
- Document the override and evidence in the confidence section
Never silently accept a tool's zero when independent evidence says otherwise.
When combining dimension scores into an overall assessment:
- Start with the highest individual dimension score as baseline
- Apply escalation rules (e.g., 3+ dimensions above a threshold)
- Apply confidence adjustments
- Cap at the maximum level
Write the formula out so it's reproducible.
Applies to commands that report findings with severity levels (e.g., /audit,
/audit). Each command should define severity criteria appropriate to its domain.
Each severity level needs a concrete, falsifiable definition tied to the command's domain. Define what each level means in terms of observable evidence, not subjective judgment. Example for security-focused commands:
| Level | Definition |
|---|---|
| CRITICAL | Confirmed data flow from untrusted source to dangerous sink, directly achievable |
| HIGH | One additional precondition needed, or confirmed missing check in active path |
| MEDIUM | Multiple preconditions, or defense-in-depth gap without confirmed exploit path |
| LOW | Code quality concern without direct impact |
Other domains will have different criteria (e.g., decompiler verification uses accuracy confidence levels, not security severity).
Every concern, finding, or recommendation must reference the specific data
field it came from (e.g., Source: dossier.dangerous_operations.security_relevant_callees).
Unsourced claims erode trust in the report.
Define a fixed set of concern categories that every run must evaluate. For each, explicitly state APPLIES (with severity and evidence) or DOES NOT APPLY (with brief reason). This prevents inconsistent coverage across runs.
Severity should reflect what the data confirms, not what might theoretically be possible. If compensating controls are unaudited, say "unknown" rather than assuming they're absent.
Applies to commands that synthesize findings or assessments from multiple data sources. Optional for commands that only retrieve or display raw data.
After synthesizing a report, launch a separate subagent to independently validate findings and their assigned levels. The verifier should receive:
- The level/severity criteria (the rubric)
- The raw data (summaries, not the synthesis reasoning)
- The draft findings with assigned levels
The verifier has never seen the synthesis logic, which eliminates confirmation bias.
If the verifier adjusts a finding's level:
- Update the final report
- Note the adjustment in the confidence section (e.g., "Finding #3 adjusted from CRITICAL to HIGH by independent verification")
- Recompute any aggregated scores that depend on the adjusted value
Apply corrections; don't dump the verifier's raw response into the report.
Don't rely on ad-hoc judgment for "what to look at next." Define a ranking:
- Collect candidate entities from the analysis results
- Assign tiers based on category (highest-risk categories first)
- Sort within tiers by a quantitative signal (e.g., reachable dangerous ops)
- Output the top N
Cap recommendations at a fixed number (e.g., 5-7). An unbounded list is not actionable.
End with a specific command the user can run next, not a vague instruction.
/callgraph appinfo.dll AiLaunchProcess --depth 3 is better than
"consider tracing the call graph."
Always write a copy of the report to a predictable location:
extracted_code/<module>/reports/<command>_<target>_<timestamp>.md
Use YYYYMMDD_HHMM for timestamps. Create the reports/ directory if needed.
Reports and saved artifacts should include traceability information: generation date, workspace run directory path, target entity identifiers, and which tool versions produced the data. This enables reproducibility and lets users know exactly what they're looking at.
If the command is marked execute-immediately, run the full pipeline and deliver the completed report. Don't pause for confirmation mid-pipeline.
Every command should specify what to do for each common failure:
| Failure | Recovery |
|---|---|
| Module not found | List available modules, ask user to choose |
| Entity not found | Fuzzy search, suggest close matches |
| DB access failure | Report error with path, suggest /health |
| Missing data | Report what's missing, offer reduced-fidelity fallback |
| Skill script failure | Log error, continue with partial results |
If one step in a multi-step pipeline fails, report what completed successfully and clearly state what's missing. A partial report is almost always more useful than an error message.
Every error path should produce a visible, structured message explaining what went wrong and what the user can do about it.
Don't just say "degrade gracefully" -- document the specific fallback paths for each data source that might be absent. Build a decision tree into the command definition:
- Analysis DB missing but
extracted_code/exists? Fall back tofunction_index.jsonfor function listing andfile_info.jsonfor module identity. Note which DB-dependent features are unavailable. - Tracking DB missing? Scope to single-module analysis. Report that cross-module resolution is unavailable.
- Assembly data absent for a function? Skip assembly-dependent steps (verification, structural metrics) and note the gap.
- Decompiled code absent? Offer assembly-only analysis where supported.
Each branch should produce a clear message about what's reduced and why.
Applies to commands where the agent reads code or data beyond what automated scripts produce -- security audits, code reviews, explanations with deep dives.
Set a hard limit (e.g., max 3). Don't pad with low-value filler.
Every observation must reference a concrete line, variable, or code pattern. Vague statements like "the error handling could be improved" are not useful.
Mark observations as manual findings to distinguish them from automated tool
output. Use a consistent label like "Manual review -- not from automated analysis".
This lets readers know which findings are reproducible by re-running scripts
and which depend on agent judgment.
Manual observations inform the reader but should not alter automated dimension scores or aggregate risk levels.
If nothing noteworthy exists beyond the automated findings, omit the section entirely. An empty section with a "no observations" note adds noise.