Skip to content

Asta flows integration into research step#68

Draft
charliemcgrady wants to merge 6 commits into
mainfrom
plan-templates
Draft

Asta flows integration into research step#68
charliemcgrady wants to merge 6 commits into
mainfrom
plan-templates

Conversation

@charliemcgrady

@charliemcgrady charliemcgrady commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This adds a flows section to the research-step schemas and expands the task
taxonomy behind it. It replaces the markdown plan templates from earlier in
this branch.

Markdown vs YAML

I tried markdown-based workflow definitions first. Rewriting the same
workflows as YAML in schemas.yaml worked much better: the structure holds up
across runs, and the definition doubles as something we can validate against.

Scripts vs Prose

Same lesson with the task graph. Prose descriptions of how beads should be
created and closed produced a slightly different graph shape every run. Task
creation and resolution are now deterministic scripts (create-task.sh,
close-task.sh): hierarchical ids, metadata initialized from the schema,
outputs validated and published on close, parent groups closed automatically
when their last child finishes.

Testing

To shake this out I ran two complete workflows end to end (theory generation
grounded in an auto-ds run, then a follow-up discovery run) and published them
with the asta workspace skill — each run page is the report series plus a
browsable task graph:

https://animated-couscous-7pqjqog.pages.github.io/

Common vs Custom taxonomy

I started out trying to reuse the existing schemas and repurpose them for
these workflows, and it was really hard — the generic task types never quite
fit what a step actually produces. I've come around to the opinion that we
should lean toward rich, workflow-specific taxonomies instead of a small
common one. Expanding the taxonomy did two jobs at once: it surfaced gaps in
asta's capabilities (task types we want but have no skill for yet), and it
let the workflows get rich without the long markdown agents drift away from.
The process was repetition: run the workflow, ask the agent to reflect on
where the schema was limiting, expand the schema, run again — until the
reports consistently came out in good shape.

rationale: string

literature_review:
inputs: [scope, definitions]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strict inputs don't really make sense in a world where multiple templates exist for research step. e.g. synthesis can happen after literature review or analysis steps. Update schemas to just document output shapes, which can then be used by LLMs to glue into downstream tasks.

# ({research_step: {task_type, inputs, output_schema_version, output}})
# 3. has every required `output.<key>` for the given <task_type> per
# assets/schemas.yaml (schema_version: 1)
# If [task-dir] (e.g. .asta/tasks/<id>) is given, also runs document-quality

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update research step to require an LLM outputs an output.md alongside the output.json. I found this to be vital to understanding what actually got executed for each step:

Image

# 5 — task_type mismatch with envelope
# 6 — required output.md missing (only when [task-dir] supplied)
# 7 — output.md empty or a stub (only when [task-dir] supplied)
# 8 — output.md has no markdown links (only when [task-dir] supplied)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored a lot of different approaches to getting the output markdown to be human understandable and to contain rich citations. Ultimately, prompting alone was insufficient and often ignored by the LLM. The validate output "linting" approach seems to do the best for steering the agent towards the quality we want in the outputs:

Image

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty interesting. Effectively, the validation shell script is the documentation for the output format

exit 8
fi
# Strip links, then flag any named entity still bare in output.md / report.tex.
unlinked=$(for f in "$md" "$task_dir/artifacts/report.tex" "$task_dir/report.tex"; do

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMs love to just refer to entities (e.g. files, literature review results) and not actually link to them. This guard makes sure that known entities are actually hyperllinked, which really helps a user navigate the output.

exit 9
fi

# The report's basics. Only the report node makes report.tex; when it exists,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if putting step-specific validations in the validate-output.sh script makes sense. It's confusing to the LLM to provide step specific scripts to run, and the likelyhood of it deciding not to run them is increased. Some sort of extension system where the template provides step-specific linters might be a good solution.

@rodneykinney rodneykinney Jun 3, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have recommended making a different validation script for each output type. But I think you're saying that you tried this and it didn't work? Still, I think it's the only solution that scales. It seems straightforward to associate the script with the task type somewhere, like in the schemas.yaml or the template file

@@ -0,0 +1,78 @@
# Example theorizer mission statement

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found providing examples of inputs to agents really helpful in getting consistency across workflows.

@@ -0,0 +1,118 @@
---
name: data_driven_theory_generation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core template for the auto-ds => theorizer workflow. I explored expressing these workflows using a strongly-typed pydantic workflow engine. I had good results, and I think a more strictly typed system might be the best long term approach. However, I ended up creating a pretty elaborate DSL, and the engine code itself was over 1000 lines.

Markdown templates are remarkably effective and seem like a good approach for now until we have a good sense of the patterns and number of templates we'd like to support.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those task types are pretty elaborate! Very cool, actually. I like the pattern of identifying the upstream inputs by just the type names.

I'm not clear on the node id vs the type. Why have both?

Many of these node types are not listed in schemas.yaml. Should they be?

I'm a little unclear about the role of schemas.yaml, actually. I think it's really useful to have an explicit description of the output format of each task, but it also looks like the validation script is serving this purpose?

Off the top of my head, what makes sense to me is for schemas.yaml to be the source of truth for the set of all possible nodes and what they produce. It certainly seems cleaner for the agent to consult schemas.yaml instead of the validation script. The template files can describe ways in which the nodes can be chained together to accomplish the research goal. That means it's unclear where to document the process for executing each of the task types. Many of them should be shared, I think. I'm shifting my view on whether a template should customize the instructions for executing a node. I think it's cleaner just introduce a new task type if there's a variant (or maybe parameterize the task), instead of overriding the implementation in the template. Maybe task types that are unique to a template can be described local to it, and we have some way of promoting them to a shared space.

|---|---|
| `literature_review`, `hypothesis`, `analysis`, `synthesis` | **plan** (with this issue as the source). `plan` then chains to **update-summary**. Note: `hypothesis` only reaches this branch in the rare case it was left open at creation; the normal path is plan→auto-resolve. |
| `scope`, `definitions`, `experiment_design`, `evidence_gathering` | **update-summary** directly. |
5. **Do the work.** Produce all three task outputs under `.asta/tasks/<id>/` — see the skill's "Task outputs" table for their roles. **All three are mandatory:** `output.json` (matches the schema), `output.md` (the readable result, with links per the template's writing rules), and `artifacts/` (every other file produced). For schema fields ending in `_path`, write the file first and put the relative path in the JSON.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core update to support templates. TLDR is instead of hardcoding the steps, point the agent to the template.

I also added instructions to persist the output of each step in a specific task directly, which enables us to build a really useful flow visualization on top of the workflow:

Image

As a human, persisting the outputs into folders which i can navigate myself is invaluable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the workflow visualization. Love the accompanying output.md.

I notice you're not consulting the template for how to execute a task, just for preferences on output.md. That seems like a good trade-off. I'm unclear on how the agent knows which template is being used across sessions. It that somewhere in the beads DB?

@charliemcgrady charliemcgrady changed the title research-step: markdown plan templates + data-driven theory generation Asta flows integration into research step Jun 10, 2026
…hypothesis-driven flow

schemas.yaml v2: tasks are pure output contracts (key -> type maps), one
outcome verdict vocabulary, immutable adjudication records, A2A 1.0
artifact/part types, config block, and the hypothesis_driven_research flow.
Ships the compiled assets (per-task JSON Schemas, flows.json, flow diagrams);
validate-output.sh deep-validates against them. New next-task.sh (single
ordering definition) and task-output-keys.sh (single schema reader);
bd list --limit 0 throughout; close-task.sh never closes the epic root.
Workflows updated to match; execute.md adds report conventions.
assets/compiled/ is generated from schemas.yaml by the schema compiler at
build time; keep the source of truth only.
@charliemcgrady charliemcgrady force-pushed the plan-templates branch 2 times, most recently from 94fe2c4 to efd94ea Compare June 16, 2026 16:48
Comment thread src/asta/utils/asta.conf
flows {
tool_name = "asta-flows"
install_type = "local"
install_source = "~/workspace/asta-flows"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to install from git, presumably. For local testing, you can set ASTA_CONFIG_FILE to a file that points to your local directory

@rodneykinney

Copy link
Copy Markdown
Member

Things that are great:

  • Clear enumeration of how steps are linked together
  • Clear declaration of expected inputs and outputs for each step
  • Clear declaration of the commands that a step should execute

More observations:

The opening paragraph research-step/SKILL.md is jargon-heavy, and specific to a particular flow. Pretty confusing. It should start with a clear definition of key terms: "This skill defines and executes a research flow, which is a chain of tasks, each of which produces set of outputs with a defined type. These are defined in schemas.yaml"

In schemas.yaml, we need clear definitions for type/task/flow. The comments for "config" are helpful; the "types" comments are not. I'm not sure "config" even belongs in schemas.yaml. It seems to contain default input values for individual flow steps.

Actually, "task" is confusing to me, because it just looks like a compound data type. It seems like each "task" could equally well be defined as as "type". The main purpose seems to be to attach a list of artifacts to an existing type, but there's probably a cleaner way to do this. Maybe each flow just implicitly produces an output type plus a list of artifacts.

Are flow step inputs earlier flow steps, or data types? I see that most flow step names have corresponding task names, but some don't (reproduction vs reproduction_synthesis). Maybe this is just an oversight, but if so it should be caught by compile-schemas.py. My best guess is that when a "type" and a "task" share a name (e.g. testability_triage), it's a coincidence, but when a "task" and a "flow" share a name, it's a way of linking them. In that case, I think you might as well get rid of the "task"s and let each flow step declare its output types directly, with the understanding that it also produces a list of artifacts.

I don't see any description of what gets written into .asta and what gets attached to a beads task's metadata. Do we need both? I can see an argument for splitting the A2A artifacts from the direct outputs.

I don't see where compile-schemas.py should be run. Is this part of the plugin release process? Then it should be part of the build-plugins make target.

I notice you abandoned beads native way of tracking task dependencies in favor of an id-based hierarchy. Curious what drove this decision, since I thought agents were able to use it pretty well. Using literal beads task IDs seem like it would make it hard to do dynamic replanning. Do you need to change a task ID to change its order in the flow?

The plan.md doc contains the best description of how to read schemas.yaml. There might be a better place for that, though. Maybe a standalone doc, since it's kind of cross-cutting. Some of the content of plan.md relates more to execution, and is flow-specific (e.g. the Gates section). The execute.md instructions look really good.

@rodneykinney

Copy link
Copy Markdown
Member

Notes from a test run:

I gave it the mission of generating theories to explain an AD result from Ai1 behavioral experiments. Agent presented a few different possible flow and recommended a stripped-down theorizer. Very cool to be able to customize a theorizer workflow!

Agent successfully built an extraction schema and constructed novelty and accuracy-focused theories from the literature.

I noticed a lot of one-off code generation, to produce step outputs I guess? I was surprised that it was needed, as I would have expected the schemas.yaml types to conform more tightly to the asta CLI outputs. Seemed to work fine, though.

From Claude's self-reflection:

  1. Plugin shipped without assets/compiled/

scripts/validate-output.sh aborts on every task close because assets/compiled is empty under asta-preview. I guess running compile-schemas.py as part of make build-plugins would fix this

  1. asta generate-theories build-extraction-schema runs the full pipeline

Help text claims it just builds the schema. The CLI call ran schema build + 56 paper extractions + 8 theory formations + 18 novelty assessments — ~38 min, $39.85. Three downstream tasks now
reduce to "adopt artifacts from disk" rather than independent steps. Either the CLI naming or the server-side behavior is wrong. Workflow assumed step-at-a-time isolation.

  1. Beads' ~64KB metadata cap collides with rich typed outputs

theory_formation close failed on the first try because 8 theories with full content trees (statements, evidence bullets, predictions, unaccounted) overran the cap. The execute workflow says "keep it slim" but
the schema's required fields are themselves nontrivial. Fixed by trimming inline to top-1 supporting/conflicting evidence per statement and ~160-280 char caps, with full content in a referenced artifact file.
Pattern that's going to repeat on theory_synthesis, hypothesis_report, anything synthesis-shaped — this should probably be enforced by the schema or by close-task.sh (size check + clear error) instead of
discovered ad-hoc.

  1. Auto-close cascade fires mid-flow when groups have one open child

After evidence_extraction (the only child of bje.2) closed, close-task.sh cascade-closed bje.2. I had to reopen it manually before laying theory_generation under it. Happened a second time when bje.2.2.1.1
closed and cascade-closed bje.2.2.1. The plan workflow's "lay only the frontier" rule guarantees this whenever a group has a single sequential step. Either lay more eagerly, or have the close-cascade respect
"more flow steps remain under this group."

  1. extracted_data schema mismatched the asta return shape

The schema is shaped for a single paper (paper_id: string), but find-and-extract returns extractions from many. Coerced with paper_id: "multi" and per-paper provenance via citation_title in each row. Validator
accepted it but it's a schema-as-documentation problem — a fresh reader of the JSON would assume single-paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants