Blackbox CTF Agent + Knowledge Base

This repo is a local-first blackbox CTF workspace with two core layers:

retrieval (rag/)
execution agent (web_agent/)

Layout

repos/: upstream knowledge sources cloned from public repositories
notes/: retrieval strategy and source tagging notes
scripts/: sync/build/run entry scripts
rag/: retrieval and index utilities
web_agent/: interpreter + planner + deliberation + capability manager + logistics layer + shared runtime state
docs/: architecture and iteration notes (init, pentagi)

File map:

mako/
  README.md
  SKILL.md
  .env.example
  docs/
    init.md
    pentagi.md
    codex.md
    runtime_codex_contract.md
  rag/
    common.py
    agent.py
    index.py
    query.py
  web_agent/
    cmd_agent.py
    solver_shared.py
    planner.py
    deliberation.py
    reflector.py
    capability.py
    logistics.py
    task_interpreter.py
  scripts/
    run_web_agent.sh
    build_rag_index.sh
    ask_rag.sh
  tests/
    test_deliberation.py
    test_policy_control.py
    test_structured_actions.py

Design Docs

docs/init.md: current baseline thinking after removing legacy sync/
docs/pentagi.md: PentAGI-driven integration notes and follow-up decisions
- includes NYU Web smoke rerun notes (2026-04-11) with intended-exploit verification and reliability-evaluation boundaries
docs/codex.md: research notes on moving this repo to a Codex-backed foundation via Responses API and function tools

Current Sources

PayloadsAllTheThings: payload patterns and bypass tricks
hacktricks: practical attack checklists and methodology
nuclei-templates: vulnerability detection templates
OWASP-CheatSheetSeries: defensive/offensive best practices and protocol references
fuzzdb: fuzz payloads and probe dictionaries
SecLists (sparse): targeted wordlists for web content discovery, fuzzing, payloads, credentials, usernames

Suggested RAG Use in Blackbox CTF

Intent routing:
- recon -> SecLists, HackTricks, OWASP
- payload generation -> PayloadsAllTheThings, fuzzdb, SecLists/Payloads
- vuln validation -> nuclei-templates, HackTricks, OWASP
- bypass tuning -> PayloadsAllTheThings, HackTricks
Chunking:
- markdown/yaml/txt chunk size: 500-1200 tokens
- keep section title + file path as metadata
- preserve code blocks as independent chunks
Retrieval:
- hybrid retrieval (BM25 + embedding)
- rerank with query-aware rules (payload-heavy query => prioritize payload repos)
Feedback loop:
- store execution result as memory (success/failure, status code, response signature)
- use self-reflection to avoid repeating failed payload families

Sync

cd mako
./scripts/sync_sources.sh

Minimal RAG Quick Start

Create env file:

cd mako
cp .env.example .env

Edit .env and set at least:

OPENAI_API_KEY=your_key

Build index:

./scripts/build_rag_index.sh

Ask:

./scripts/ask_rag.sh "针对一个疑似SSTI输入点，先做哪些黑盒验证？"

Hybrid retrieval is enabled by default. You can tune:

./scripts/ask_rag.sh "如何验证SSRF并区分内网探测回显?" --mode hybrid --top-k 8 --alpha 0.7
./scripts/ask_rag.sh "xss filter bypass payload" --mode bm25 --top-k 10

Web Agent (Interpreter + Solver)

Build index first, then run:

./scripts/build_rag_index.sh
./scripts/run_web_agent.sh "http://127.0.0.1:8080/" "Find SQL injection and retrieve flag"

Architecture:

task_interpreter reads objective + hint + observed signals + RAG context
interpreter writes task_prior.* into shared sqlite memory
planner turns priors and runtime evidence into explicit subtasks
solver deliberation reads:
- task priors
- active subtask
- persistent facts
- hypotheses
- reflection constraints
controller reflection runs in parallel and emits policy constraints
validator layer blocks phase drift, low-gain repeats, and semantic-recovery violations
in codex / codex_collab mode, a single Codex tactical solver proposes the next executable step
in codex_collab mode, a counter-solver / falsifier attacks the current route and proposes a cheap discriminator experiment before the tactical step
capability manager evaluates execution gaps (reuse existing action vs write helper vs install dependency vs replan)
logistics layer executes environment/setup work requested by capability resolution and records it outside challenge-step counting
command output updates:

facts
reflection state
hypothesis lifecycle
plan patches for follow-up subtasks

codex_collab runs also write artifacts/.../<run_id>/codex_dialogue.jsonl, which records the counter-solver and tactical solver prompts/replies for post-run inspection. 11. interpreter/planner are refreshed periodically or after drift / repeated low-gain failure

Architecture diagram:

flowchart TD
    U[User / Challenge<br/>target + objective + hint] --> I[Task Interpreter<br/>task_interpreter.py]
    I -->|write task_prior.*| M[(Shared Memory<br/>SQLite)]
    M --> P[Planner<br/>planner.py]

    P -->|active subtask| S[Deliberation Layer<br/>recommender -> corrector -> judge]
    D[RAG Index<br/>rag_data/index.jsonl] --> I
    D --> P
    D --> S

    S --> C[Capability Manager<br/>reuse -> helper -> install -> replan]
    C --> L[Logistics Layer<br/>env setup / helper generation / tool supplementation]
    L -->|final action/command| E[Executor / Local Environment<br/>curl / sqlmap / bash / ffuf]
    E --> O[Observed Output<br/>stdout / stderr / timing]

    O --> F[Fact Extractor<br/>extract_facts]
    O --> R1[Execution Reflector<br/>reflect_step in solver_shared.py]
    O --> R2[Policy Reflector<br/>reflector.py]
    O --> H[Hypothesis Manager<br/>update_hypotheses]
    O --> PP[Plan Patch<br/>build_plan_patch]

    F -->|facts| M
    R1 -->|reflect.*| M
    R2 -->|controller.*| M
    H -->|hypothesis.*| M
    PP -->|plan.current| M
    M -->|task_prior + plan + facts + hypotheses + reflection| P
    M -->|active subtask + facts + reflection| S

Current runtime stack:

task_interpreter -> planner -> recommender -> corrector -> judge
                   -> capability -> logistics -> executor
executor -> fact extraction / hypothesis updates / execution reflection / policy reflection
          -> plan patch -> planner

Shared-memory data model:

task_prior.*   -> interpreter-produced priors
facts          -> runtime observations and extracted signals
reflect.*      -> failure reason, strategy update, next-step constraints
plan.*         -> current plan, active subtask title/phase/action hint
controller.*   -> policy reflection outputs (must_do / must_avoid / clusters)
hypothesis.*   -> candidate / confirmed / weak_candidate / rejected
run.*          -> challenge-step and capability-step counters
events         -> compact execution trace
flows          -> run-level status
tasks_state    -> objective-level execution status
subtasks_state -> step-level execution status and outcome

Optional in .env:

OPENAI_AGENT_MODEL=gpt-5.2

Main modules:

rag/index.py, rag/query.py, rag/agent.py, rag/common.py
web_agent/task_interpreter.py
web_agent/planner.py
web_agent/deliberation.py
web_agent/capability.py
web_agent/logistics.py
web_agent/reflector.py
web_agent/solver_shared.py
web_agent/cmd_agent.py

Execution workflow is phase-based:

recon
probe
exploit
extract
verify
summarize and save run log to logs/cmd_agent_last_run.json

Memory database:

default path: logs/agent_memory.sqlite
shared by interpreter and solver under one run_id
stores:
- task_prior.*
- extracted facts
- reflect.*
- hypothesis.*
designed to be vuln-agnostic (not SQL-only)

Control policy:

phase state machine: recon -> probe -> exploit -> extract -> verify
interpreter priors strongly constrain solver drift
planner owns an explicit ordered subtask list and exposes one active subtask at a time
reflector/controller outputs are used to patch the current plan
action validator uses modular rule registries (semantic rules + controller rules)
failure reasons are normalized and mapped to canonical failure clusters
capability acquisition is a separate loop and does not consume challenge-step budget
each step records an info_gain score from newly discovered facts
each step generates a reflection
hypotheses are explicitly tracked as:

candidate
confirmed
weak_candidate
rejected

Interpreter behavior:

converts description + shown information into task_prior
identifies likely challenge family and tech stack
proposes:
- primary hypotheses
- secondary hypotheses
- deprioritized routes
- exploit chain candidates
prevents the solver from drifting too early into unrelated routes

Deliberation behavior:

recommender proposes one executable step for the active subtask
corrector aggressively finds weaknesses and may replace that step with a corrected executable proposal
judge selects the final proposal before execution
execution results are written back into memory
reflection and planner patches update subsequent subtasks
supports structured actions for brittle execution chains:
- http_probe_with_baseline
- extract_html_attack_surface
- cookiejar_flow_fetch
- service_recovery_probe
- multipart_upload_with_known_action
- build_jsp_war
- tomcat_manager_read_file

Capability and logistics behavior:

capability resolution detects whether the chosen proposal lacks a required tool or dependency
capability scoring prefers:
- existing structured action
- generated helper script
- controlled install
- replan
logistics strategy selection is model-driven (pip vs system_package_manager vs skip_install) with deterministic safety fallback
logistics executes support work for the chosen option
logistics work is tracked as capability_steps and does not consume challenge_steps
install targets are dynamic (not a fixed allowlist), but install timing is constrained by capability-policy and step accounting

Quick Fuzz (No LLM)

./scripts/run_quick_fuzz.sh "http://127.0.0.1:8080/" path path-small
./scripts/run_quick_fuzz.sh "http://127.0.0.1:8080/search" param-value ssti

Supported default wordlists:

path-small
param-names
ssti
xss
cmdi

FAQ

Q: What does "mako" mean in this project?
A: mako refers to Hitachi Mako (常陆茉子) and has no other meaning.
问：这个项目里的 “mako” 是什么意思？
答：mako 指的是常陆茉子，除此之外没有任何含义。
質問：このプロジェクトにおける「mako」は何を意味しますか？
回答：mako は常陸茉子を指し、それ以外の意味はありません。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blackbox CTF Agent + Knowledge Base

Layout

Design Docs

Current Sources

Suggested RAG Use in Blackbox CTF

Sync

Minimal RAG Quick Start

Web Agent (Interpreter + Solver)

Quick Fuzz (No LLM)

FAQ

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
notes		notes
rag		rag
repos		repos
scripts		scripts
tests		tests
web_agent		web_agent
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

Blackbox CTF Agent + Knowledge Base

Layout

Design Docs

Current Sources

Suggested RAG Use in Blackbox CTF

Sync

Minimal RAG Quick Start

Web Agent (Interpreter + Solver)

Quick Fuzz (No LLM)

FAQ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages