Skip to content

Add Devin plugin (plugins/devin): MCP server + ATIF-v1.7 harvest#88

Open
xerxes-y wants to merge 4 commits into
microsoft:mainfrom
xerxes-y:add-devin-plugin
Open

Add Devin plugin (plugins/devin): MCP server + ATIF-v1.7 harvest#88
xerxes-y wants to merge 4 commits into
microsoft:mainfrom
xerxes-y:add-devin-plugin

Conversation

@xerxes-y

Copy link
Copy Markdown

Wires the skillopt_sleep engine into Devin (Cognition) via an MCP server, following the same thin-shell pattern as plugins/copilot.

  • mcp_server.py: stdlib-only stdio MCP server exposing the standard sleep_* tools (status, dry-run, run, adopt, harvest). REPO_ROOT defaults to ../.. so it finds skillopt_sleep automatically when run from plugins/devin/.
  • harvest_devin.py: converts Devin ATIF-v1.7 transcripts, agentmemory, and .devin/skills/*/SKILL.md into the Claude Code-compatible JSONL the engine consumes; enriches with taskKey + outcome envelopes (hard test/build signal or judge rubric). Workspace auto-detection; cross-platform paths.
  • judge.py, mcp-config.example.json, devin-rules.snippet.md, README.md.
  • plugins/README.md: add Devin to the platform + install tables.

No changes to skillopt_sleep; shells out to python -m skillopt_sleep like the other plugins. Pure stdlib; default backend mock (no API spend).

Wires the skillopt_sleep engine into Devin (Cognition) via an MCP server,
following the same thin-shell pattern as plugins/copilot.

- mcp_server.py: stdlib-only stdio MCP server exposing the standard sleep_*
  tools (status, dry-run, run, adopt, harvest). REPO_ROOT defaults to ../.. so
  it finds skillopt_sleep automatically when run from plugins/devin/.
- harvest_devin.py: converts Devin ATIF-v1.7 transcripts, agentmemory, and
  .devin/skills/*/SKILL.md into the Claude Code-compatible JSONL the engine
  consumes; enriches with taskKey + outcome envelopes (hard test/build signal
  or judge rubric). Workspace auto-detection; cross-platform paths.
- judge.py, mcp-config.example.json, devin-rules.snippet.md, README.md.
- plugins/README.md: add Devin to the platform + install tables.

No changes to skillopt_sleep; shells out to `python -m skillopt_sleep` like the
other plugins. Pure stdlib; default backend mock (no API spend).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Yif-Yang

Copy link
Copy Markdown
Contributor

Thanks for the Devin integration @xerxes-y — the thin-shell MCP pattern mirroring plugins/copilot is the right shape. 🙏

Two things before we can evaluate this for merge:

1. Could you share some test/validation results? Right now the PR adds no tests, and the harvester + MCP server are non-trivial. A short demonstration would go a long way — e.g. running harvest_devin.py against a sample ATIF-v1.7 transcript and showing the resulting JSONL, plus a sleep_status/sleep_dry-run round-trip through the MCP server. Even a couple of small unit tests (schema of the exposed tools, harvester output shape) in line with tests/test_mcp_schema.py would help us trust the integration.

2. Likely path bug: SKILLOPT_DEVIN_CLAUDE_HOME isn't expanded when read from the env. In mcp_server.py:

CLAUDE_HOME = os.environ.get(
    "SKILLOPT_DEVIN_CLAUDE_HOME",
    os.path.expanduser("~/.skillopt-sleep-devin"),
)

Only the fallback default gets expanduser. Your mcp-config.example.json sets "SKILLOPT_DEVIN_CLAUDE_HOME": "~/.skillopt-sleep-devin", so when that env var is present the literal ~/... is passed straight to --claude-home (line ~147) without expansion — the harvester then writes under a literal ~ directory while the engine reads elsewhere, yielding zero mined sessions for the documented config. Wrapping the whole thing in os.path.expanduser(...) should fix it.

Also note the CLA check is still pending — that'll need to pass before merge.

Appreciate the contribution; with some validation output and that path fix this'll be in good shape.

xerxes-y and others added 3 commits June 25, 2026 21:49
…ture

Review fixes:
- Path bug: SKILLOPT_DEVIN_CLAUDE_HOME (and SKILLOPT_SLEEP_REPO) read from the
  env are now wrapped in os.path.expanduser, so the documented "~/..." config
  no longer passes a literal ~ to --claude-home (which yielded zero mined
  sessions). expanduser on an absolute default is a no-op.
- tests/test_devin_plugin.py: tool-schema completeness, action→subcommand map,
  backend enum, the CLAUDE_HOME expansion regression, and an ATIF-v1.7 harvest
  shape test against a bundled fixture.
- plugins/devin/fixtures/devin_sample.json: sample ATIF-v1.7 transcript.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the copilot MCP server: same rich _TOOL_SCHEMA (source, model,
tasks_file, target_skill_path, max_sessions, max_tasks, lookback_hours,
auto_adopt, json, edit_budget, hour, minute) and generic flag forwarding, plus
sleep_schedule / sleep_unschedule. Devin specifics retained: the ATIF-v1.7
harvest step (run before data-reading actions, engine pointed at it via
--claude-home, default --source claude) and post-adopt sync into .devin/skills/.
Tests + README + rules snippet updated for the 7-tool interface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A harvested single-turn Devin session spanned only 1s (reply written 1000ms
after the prompt), which the engine's harvest filter conservatively classifies
as a <3s headless replay (skillopt_sleep Issue microsoft#62) and skips — so a real
single-turn session mined 0 tasks. Widen the prompt->reply gap to 5s. With this,
an end-to-end dry-run mines the task: "night 1: 1 sessions -> 1 tasks".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xerxes-y

Copy link
Copy Markdown
Author

PR reply — post this as a comment on the microsoft/SkillOpt PR

Thanks for the careful review! Both points addressed in
devin plugin: expand ~ in CLAUDE_HOME from env + add tests & ATIF fixture
(pushed to the branch).

1. Path bug — fixed

Good catch. SKILLOPT_DEVIN_CLAUDE_HOME (and SKILLOPT_SLEEP_REPO) read from the
env are now wrapped in os.path.expanduser, so the documented "~/..." config
no longer passes a literal ~ to --claude-home. expanduser on an absolute
default is a no-op. There's a regression test for exactly this
(TestClaudeHomeExpansion).

2. Tests + validation

Added tests/test_devin_plugin.py (mirrors tests/test_mcp_schema.py) and a
bundled plugins/devin/fixtures/devin_sample.json (ATIF-v1.7):

$ python3 -m unittest tests.test_devin_plugin -v
test_env_tilde_is_expanded ... ok
test_atif_fixture_yields_gradeable_task ... ok
test_actions_map_to_engine_subcommands ... ok
test_backends_in_enum ... ok
test_tools_are_the_sleep_interface ... ok
Ran 5 tests in 0.005s — OK

Harvest a sample ATIF-v1.7 transcript → outcomes.jsonl:

$ python3 plugins/devin/harvest_devin.py \
    --devin-transcripts plugins/devin/fixtures --out-dir /tmp/out
[harvest_devin] devin        : 1 sessions
[harvest_devin] total        : 1 synthetic sessions → /tmp/out

$ cat /tmp/out/outcomes.jsonl
{"type":"outcome","sessionId":"devin_demo-001",
 "taskKey":"general:fix:nullpointerexception","success":true,
 "verifier":"tests","evidence":"BUILD SUCCESS",
 "reference":{"repro":"rtk mvn test -Dtest=OrderServiceTest"}}

The converted transcript carries the grouping key on the user turn:
{"type":"user","taskKey":"general:fix:nullpointerexception", ...}.

sleep_status round-trip through the MCP server (engine, mock backend):

$ printf '%s\n' \
  '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \
  '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"sleep_status","arguments":{"project":"/tmp/demo","backend":"mock"}}}' \
  | python3 plugins/devin/mcp_server.py
# → [harvest] ... synthetic sessions
#   [engine]  [sleep] nights so far: 0
#             [sleep] no staged proposals yet.

tools/list exposes the standard interface (now 7 tools incl. schedule).

Full sleep_dry_run end-to-end (mock backend):

$ # ...sleep_dry_run tools/call through mcp_server.py
[harvest]  [harvest_devin] devin: 1 sessions → 1 synthetic session
[engine]   [sleep] night 1: 1 sessions -> 1 tasks
           [sleep] held-out 0.000 -> 0.000 => reject (accepted=False)

i.e. harvest → mine → replay → held-out gate all run; the mock backend
correctly rejects (no real improvement).

While validating this I found and fixed a real integration bug: a harvested
single-turn Devin session spanned only 1s, which the engine's harvest filter
classifies as a <3s headless replay (Issue #62) and skips — so it mined 0
tasks. Widening the prompt→reply gap to 5s fixes it (the run above mines the task
correctly).

3. Schema / tool parity with copilot

Also went ahead and brought the server to full parity with plugins/copilot:
the same rich _TOOL_SCHEMA (source, model, tasks_file,
target_skill_path, max_sessions, max_tasks, lookback_hours,
auto_adopt, json, edit_budget, hour, minute) and generic flag
forwarding, plus sleep_schedule / sleep_unschedule. The Devin specifics
are retained: the ATIF harvest runs before data-reading actions (engine pointed
at it via --claude-home, default --source claude) and the post-adopt sync
into .devin/skills/. tools/list now exposes all 7 sleep_* tools; tests
updated accordingly.

@xerxes-y

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants