Skip to content

Test overhaul + 2 security bug fixes#140

Open
yyy900 wants to merge 8 commits into
mainfrom
cleanup/test-overhaul
Open

Test overhaul + 2 security bug fixes#140
yyy900 wants to merge 8 commits into
mainfrom
cleanup/test-overhaul

Conversation

@yyy900
Copy link
Copy Markdown
Collaborator

@yyy900 yyy900 commented May 14, 2026

Summary

Test infrastructure overhaul + two security bug fixes uncovered while writing tests.

Two bugs found and fixed during testing (both flagged in commit messages):

  1. bash_parser.py permission bypass — the AST walker stopped at the first word of each command and never descended into commandsubstitution nodes. echo $(rm -rf /) reported only ["echo"], so a user with Bash(echo *) approved could chain in any command via $() or backticks. Now recurses into substitutions.
  2. host/auth.py 500 on malformed timestampabs(now - timestamp) raised TypeError when timestamp was a string, surfacing as 500 instead of 401. Now guarded with isinstance and returns clean unauthorized.

Other changes:

  • Test layout collapsed from 5 dirs (unit/integration/cli/e2e/real_api) to 2: unit/ and e2e/ (with e2e/cli/, e2e/real_api/, e2e/manual/ subdirs). Markers auto-applied by folder.
  • Dead tests removed (1044 LOC): test_smoke, test_imports, test_env_autoload, test_bash_chain_permissions, test_benchmarks, test_announce_only — all integers' worth of assert True / module-level skips / "needs rewrite" markers.
  • Metadata-only tests removed (21 tests across 11 files): assert handler._event_type == 'X', assert len(plugin) == N, plugin == [...] — these break only on cosmetic refactors, not on real bugs. Kept test_events.py (where the decorator IS the unit under test).
  • 70+ new unit tests covering 10 previously-uncovered modules: runtime_input, ulw, bash_parser, fuzzy, element_finder format, session/merge, session/storage, ws_router/ping, agent error paths, auto_compact, scroll, exceptions.
  • pyproject.toml: dropped 4 unused dependencies (litellm, click, toml, PyJWT) — verified zero imports across the repo. click is still installed transitively via typer.
  • Env-dep failures fixed: test_plugin_system.py and test_config_permissions.py were silently failing locally for anyone without OPENAI_API_KEY — now use MockLLM.
  • Docs updated: CLAUDE.md, tests/README.md, TEST_ORGANIZATION.md, e2e/cli READMEs, agent.md; 13 source-file LLM-Note path refs updated.

Numbers

Before After
Unit tests ~1830 1870 (+40 net, more delta below)
Dead test LOC 1044 lines 0
Metadata-only tests 21 0 (outside test_events.py)
0-coverage modules 10+ 0 of the audited list
Unused declared deps 4 0
Test directories 5 2

Test plan

  • pytest tests/unit — 1870 pass, 5 skip (PIL/Windows/optional TUI), 0 fail
  • pytest --collect-only — 2147 collected, 139 deselected (default not real_api and not network)
  • No stale path refs (grep "tested by \[tests/cli\|tests/integration\|tests/real_api" returns 0)
  • Real-API tests not run here (require keys; verify in CI with secrets)
  • CI green on all matrix Python versions

Commits

  1. chore(deps) — drop 4 unused deps
  2. test: overhaul test layout — reorg + dead/metadata cleanup + MockLLM fixes
  3. fix(security) — bash_parser $() recursion
  4. fix(auth) — non-numeric timestamp → 401
  5. test: add unit tests for 10 previously-uncovered modules
  6. docs — test layout docs + LLM-Note paths

@yyy900 yyy900 force-pushed the cleanup/test-overhaul branch from 50c9b70 to 731f702 Compare May 14, 2026 08:54
yf added 8 commits May 14, 2026 19:24
Verified zero imports of these packages across the repo:
- litellm: never imported (project has its own multi-provider LLM abstraction)
- click: never imported directly (transitive via typer, will still install)
- toml: never imported (no tomli/tomllib/tomlkit either)
- PyJWT: never imported (jwt source code removed earlier, declaration left behind)
Layout changes:
- tests/integration/ removed (8 files redistributed: 5 to unit/, 3 to e2e/)
- tests/cli/ → tests/e2e/cli/
- tests/real_api/ → tests/e2e/real_api/
- tests/integration/manual/ → tests/e2e/manual/
- pytest.ini and tests/conftest.py drop the integration marker

Dead-code removal (1044 LOC):
- test_smoke.py (assert True placeholder)
- test_imports.py (zero-assert importlib smoke)
- test_env_autoload.py (module-level skip — never runs in suite)
- test_bash_chain_permissions.py (all @Skip with "needs rewrite")
- test_benchmarks.py (MockLLM "benchmarks" + History API removed)
- test_announce_only.py (skipif CI + if __name__ → demo, not test)

Partial-skip cleanup (5 dead tests removed):
- test_real_gemini_co.py::test_gemini_flash_via_co (permanent skip)
- test_cli_deploy.py: test_deploy_requires_api_key, TestDeployCleanup,
  test_deploy_loads_api_key_from_env_file (skip + assert True + no asserts)
- test_cli_auth_google.py::TestAuthGoogleIntegration (class-level skip)
- test_cli_auth_microsoft.py::TestAuthMicrosoftIntegration (class-level skip)

Metadata-only test removal (21 tests across 11 files):
- All `assert handler._event_type == 'X'` style assertions
- All `assert len(plugin) == N` / `plugin == [...]` literal export checks
- Whole TestXxxPlugin classes when both methods were metadata-only
- Kept test_events.py decorator tests (those test the decorator itself)

Env-dependency fixes:
- test_plugin_system.py: pass llm=MockLLM() instead of model="gpt-4o-mini"
- test_config_permissions.py: same — avoids OPENAI_API_KEY requirement
The AST walker in extract_commands_from_bash and _extract_subcommands stopped
at the first word of a command node and never visited child commandsubstitution
nodes. Result: `echo $(rm -rf /)` reported only ["echo"], so an attacker who
had `Bash(echo *)` approved could chain in any command via $() or `...` and
the chain check would pass.

Fix: after capturing the first word of a command node, recurse into each
part's sub-AST (parts / list / command attrs) so substitutions are visited.

Adds 25 unit tests covering:
- single command / && / || / | / ; chains
- $() and backtick substitution extraction
- nested $(echo $(date))
- permission chain rejection when $() contains unauthorized command
- wildcard / when / source / disallowed-permission paths
extract_and_authenticate did `abs(now - timestamp)` without type-checking the
payload field. A signed request with timestamp="not-a-number" crashed the
handler with TypeError, which surfaces as 500 to the caller instead of a
clean unauthorized response.

Fix: guard with `isinstance(timestamp, (int, float))` and reject as
"unauthorized: timestamp must be numeric" before the math.

Adds 10 attack-surface tests:
- tamper-after-signing (prompt / to / extra field) all rejected
- identity-swap (sig made with key A, from=key B) rejected
- boundary timestamp (exactly at expiry, one past expiry)
- timestamp=0 treated as missing
- string timestamp returns clean error (regression for this fix)
- replay-within-window documented as currently allowed
- hex case-insensitivity verified
70+ new functional tests, no metadata-only assertions:

- test_runtime_input.py (9): IO no-op gates, frame prefix injection,
  trace event fields, empty-prompt skip, missing-id defaults
- test_ulw.py (16): mode_change state transitions, max-turns IO actions
  (continue/switch_mode/unknown/no-IO), prompt poll + injection
- test_fuzzy.py (14): subsequence matching, case-insensitivity, score
  ordering (consecutive > scattered), word-boundary bonuses for / _ - . space
- test_element_finder_format.py (13): LLM-facing format contract — empty
  list, text/placeholder/aria/role/input_type rendering, href query-stripping
  + 30-char truncation, frame=main implicit
- test_session_merge.py (7): iteration tie-breaker, updated-timestamp
  fallback, missing-field defaults
- test_session_storage.py (15): save/get round trip, last-write-wins, TTL
  expiry (running exempt), list dedup + sort, checkpoint 24h TTL
- test_ws_ping.py (3): 30s interval send, multi-period loop, send-failure
  propagation
- test_agent_error_paths.py (7): max iterations message, per-call override,
  stop_signal short-circuit, InsufficientCredits + generic LLM exception
  propagation, tool exception captured in trace
- test_auto_compact.py (15): threshold + message-count gates, io event
  contract (compacting / done / error), _do_compact structure preservation,
  _format_messages_for_summary truncation rules
- test_scroll.py (8 + 2 PIL-skip): strategy fallback chain, exception in
  one strategy doesn't short-circuit, screenshot diff defensiveness
- test_exceptions.py (17): InsufficientCredits / LLMConnectionError /
  ProviderServiceError / ToolRejectedError attribute extraction, message
  formatting, __cause__ preservation, defensive defaults
User-facing docs:
- CLAUDE.md: Test Organization section rewritten for unit + e2e two-layer
  structure; add Test Quality Rules section warning against metadata-only
  assertions
- tests/README.md: directory tree + Run by Category + Test Categories
  sections rewritten
- tests/TEST_ORGANIZATION.md: full rewrite — Strategy / Folder Structure /
  Markers / Decision Tree / Naming / Notes
- tests/e2e/cli/README.md + aws/README.md: sh script paths updated to
  tests/e2e/cli/ prefix
- docs/concepts/agent.md: "pytest -m real_api = integration tests" comment
  corrected

Source LLM-Note docstring refs (13 files):
- tests/cli/test_X.py → tests/e2e/cli/test_X.py
- tests/real_api/test_X.py → tests/e2e/real_api/test_X.py
- bash_parser.py: tests/integration/test_bash_chain_permissions.py (deleted)
  → tests/unit/test_bash_parser.py
… shadow

`useful_plugins/__init__.py` does `from .auto_compact import auto_compact`,
which rebinds the package attribute `connectonion.useful_plugins.auto_compact`
from the *module* to the plugin-export *list* of the same name. As a result
`import connectonion.useful_plugins.auto_compact as ac_module` actually binds
`ac_module` to the list, and on Python 3.10 mock.patch's string-path
resolution also lands on the list, breaking attribute lookup with:

    AttributeError: [<function check_and_compact at 0x...>] does not have
                    the attribute '_do_compact'

Fix the test file to grab the real module from `sys.modules` and use
`patch.object(module, attr)` throughout. Same workaround applied for
`connectonion.llm_do` since `__init__.py` shadows it the same way.

Tests pass on Python 3.13 locally; CI will verify 3.10/3.11/3.12.
…_commands

handle_sub_sync_one / handle_sub_list / handle_sub_remove use Rich console
with inline markup (e.g. `0x` printed as `[bold red]0x[/bold red]`). Under
capsys the colored escape codes show up in the captured stdout and split
the literal substrings the tests look for:

    'alice' is not a \x1b[1;31m0x\x1b[0m address...
                                                 ↑ "not a 0x address" never matches

Fix on the test side: add a tiny `_plain()` helper that strips
`\x1b\[[0-9;]*m` and route every `readouterr().out` assertion through it.

The test failure was a pre-existing upstream issue (fails on plain main),
surfaced again on this PR's rebase target.
@yyy900 yyy900 force-pushed the cleanup/test-overhaul branch from e55ca6b to 1aaa406 Compare May 14, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant