vedanthvdev · vedanthvdev · Apr 22, 2026 · Apr 22, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,113 @@ adheres to [Semantic Versioning](https://semver.org/).
 
 ## [Unreleased]
 
+### Added — discovery-incomplete signal + nested `gradlew` timeout (v1.9.22)
+
+The two deferred findings from the v1.9.21 reliability batch: both close
+long-tail silent-failure modes that could leave an operator convinced a
+merge-gate run was green when it was in fact half-blind or stuck.
+
+- **`Situation.DISCOVERY_INCOMPLETE` now fires whenever any scanned Java
+  file fails to parse.** Before this release, `JavaParsers.parseOrWarn`
+  emitted a `WARN` and returned `null`, and the strategy call site
+  silently continued — the dispatch map was built from whatever happened
+  to parse. For a branch with one mid-refactor syntax error that meant
+  `DISCOVERY_SUCCESS` with a quietly-shrunk selection, which is exactly
+  the scenario where the merge-gate safety net needs to pay out.
+  `ProjectIndex#compilationUnit` now increments a per-run
+  `parseFailureCount` at the cache boundary (de-duplicated across
+  strategies — a file that fails to parse once counts once, no matter
+  how many of naming / usage / implementation / transitive ask for it).
+  `AffectedTestsEngine#run` consults that count after discovery and
+  routes through the new situation before the old `DISCOVERY_EMPTY` /
+  `DISCOVERY_SUCCESS` branches. The mode-default resolution matches the
+  existing asymmetry — `CI` and `STRICT` escalate to `FULL_SUITE`
+  (safety wins over speed on merge-gate runs), `LOCAL` stays on
+  `SELECTED` (iteration speed wins for the developer actively editing
+  the broken file, who already has the `WARN` in their terminal). The
+  new knob is configurable via the Gradle DSL:
+  ```groovy
+  affectedTests {
+      onDiscoveryIncomplete = "FULL_SUITE" // or "SELECTED" / "SKIPPED"
+  }
+  ```
+  A matching `EscalationReason.RUN_ALL_ON_DISCOVERY_INCOMPLETE` feeds
+  `--explain` and the `affectedTest` task's skip-reason / escalation
+  copy so operators can tell at a glance whether a full-suite run was
+  triggered by an empty diff, an unmapped file, or a parse failure.
+  Regression coverage:
+    - `ProjectIndexTest.parseFailureCountIncrementsOnUnparseableFile`
+      pins the cache-boundary counting contract and the de-duplication
+      behaviour (three calls on the same broken file still count as
+      one failure).
+    - `AffectedTestsEngineTest.discoveryIncompleteEscalatesToFullSuiteInCI`
+      and `discoveryIncompleteStillSelectsInLocalMode` pin the
+      CI-vs-LOCAL asymmetry.
+    - `discoveryIncompleteRespectsExplicitOnDiscoveryIncompleteOverride`
+      pins the explicit-wins-over-mode tier.
+    - Three `AffectedTestsConfigTest` cases pin the mode defaults for
+      `CI`, `LOCAL`, and `STRICT` so a future refactor of the resolver
+      can't silently bump zero-config users.
+- **New `affectedTests.gradlewTimeoutSeconds` knob deadlines the nested
+  `./gradlew` invocation.** The task spawns a child `./gradlew :module:test
+  --tests …` for each affected module. Before this release the child
+  was launched via Gradle's `ExecOperations`, which has no timeout
+  surface at all — a hung test kept the CI worker busy until the outer
+  CI-system deadline killed the whole job, which typically surfaces as
+  "pipeline failed with no useful logs" hours later. The new knob is a
+  wall-clock deadline in seconds; `0` (the default) preserves the
+  pre-v1.9.22 "wait forever" behaviour for zero-config callers so the
+  upgrade is safe. Positive values branch the task onto a `ProcessBuilder`
+  watchdog path: the child still streams its output to the operator's
+  terminal via `inheritIO()`, and on timeout the task runs a polite
+  `destroy()` on the wrapper → 10-second grace → `destroyForcibly()`
+  on both the wrapper and every snapshotted descendant (shared Gradle
+  daemons, test JVMs) → 5-second reap ladder before failing the build
+  with a message that names the setting so the operator knows what to
+  raise. Because `./gradlew` connects to a shared daemon JVM by
+  default, killing only the wrapper would leave the actually-hung
+  test JVM running as a grandchild; the watchdog snapshots
+  `ProcessHandle#descendants()` *before* signalling (once the wrapper
+  exits, re-parented daemons become unreachable from
+  `process.descendants()`) and forcibly destroys every live descendant
+  after the wrapper is reaped, regardless of whether the wrapper
+  itself exited gracefully on `SIGTERM` or had to be `destroyForcibly`
+  escalated. This matters because a SIGTERM-responsive wrapper
+  (plain `sh`, most test runners) exits on `destroy()` but leaves its
+  children re-parented to pid 1 — those grandchildren are the real
+  hung workload and would keep holding the CI worker hostage if the
+  descendants-kill only ran on the forcible leg. Operators who need a
+  hard cutoff with no daemon reuse can pass `--no-daemon` at the
+  outer build level.
+  Outer-build cancellation (Ctrl-C, Gradle daemon shutdown) propagates
+  via the same snapshot-then-destroyForcibly path and re-asserts the
+  interrupt. The trade-off for opting in:
+  the watchdog path uses `inheritIO` instead of `ExecOperations`, so
+  Develocity build-scan stream capture of the child's output is not
+  available on timed runs — callers who rely on scan ingestion of the
+  nested output should leave the knob at `0` and enforce a deadline at
+  the CI-job level instead.   Validation lives at the builder gate:
+  `AffectedTestsConfigTest.gradlewTimeoutRejectsNegativeValues` fails
+  when a user sets `-1` (typoed "no timeout"), and
+  `gradlewTimeoutDefaultsToZero` pins the zero-config default so any
+  future refactor that flips the default breaks loudly instead of
+  retroactively deadline-killing every long-running CI on upgrade.
+  Runtime coverage lives in `AffectedTestTaskTimeoutTest`
+  (POSIX-only): `hungChildIsKilledWithinLadderBudget` pins the ladder
+  budget, `sigtermSwallowingChildEscalatesToForcibleKill` exercises
+  the forcible leg against a `trap '' TERM; sleep 60` wrapper,
+  `successfulChildReturnsExitCodeBeforeTimeout` pins both the happy
+  path and non-zero exit propagation, and
+  `descendantsAreTerminatedNotOnlyTheWrapper` writes the grandchild's
+  PID to a pidfile and asserts via `ProcessHandle.of(pid).isAlive()`
+  that the re-parented `sleep` is reaped — the exact grandchild-
+  survival bug the descendants-kill loop closes. The last test was
+  the one that surfaced the "only the forcible leg kills descendants"
+  regression in the first cut of the fix: if the wrapper exited
+  gracefully on SIGTERM the descendants loop never ran and the
+  orphaned sleep survived. The fix now reaps descendants on both
+  legs.
+
 ### Fixed — tier-1 reliability / safety batch (v1.9.21)
 
 The v1.9.20 batch closed every tier-1 *correctness* bug surfaced by the

diff --git a/README.md b/README.md
@@ -222,6 +222,27 @@ affectedTests {
     onAllFilesOutOfScope   = "skipped"
     onUnmappedFile         = "full_suite"  // the key MR-safety knob
     onDiscoveryEmpty       = "full_suite"  // belt-and-braces for CI
+    // Fires when any scanned Java file failed to parse (malformed source,
+    // half-committed refactor, encoding glitch). Mode defaults: CI / STRICT
+    // escalate to full_suite (selection is known to be under-reported so
+    // safety wins); LOCAL stays on selected (the developer sees the WARN
+    // and values iteration speed on a branch they're actively editing).
+    onDiscoveryIncomplete  = "full_suite"
+
+    // ---------------- Child-process deadline (v1.9.22) ----------------
+
+    // Wall-clock timeout in seconds for the nested `./gradlew :module:test`
+    // invocation. 0 (the default) disables the watchdog and matches
+    // pre-v1.9.22 behaviour — the task waits forever. Positive values
+    // spawn the child under a ProcessBuilder watchdog: on timeout the
+    // task runs destroy() → 10s grace → destroyForcibly() → 5s reap and
+    // then fails the build so a hung test never holds the CI worker
+    // hostage. Note: the watchdog path uses inheritIO for the child's
+    // stdio, so Develocity build-scan stream capture of the nested
+    // output is not available when a timeout is set — leave at 0 and
+    // enforce the deadline at the CI-job level if you rely on scan
+    // ingestion of child-process output.
+    gradlewTimeoutSeconds  = 1800  // 30 min; use 3600 for suites with integration tests
 
     // ---------------- Discovery tuning ----------------
 
@@ -309,6 +330,7 @@ Every row below shows the situation the engine resolved, and the action applied
 | Any YAML / Gradle / Liquibase / `.java` outside configured dirs | `UNMAPPED_FILE` | `FULL_SUITE` (via `onUnmappedFile = "full_suite"`) | `onUnmappedFile = "selected"` |
 | No changed files at all | `EMPTY_DIFF` | `SKIPPED` | `onEmptyDiff = "full_suite"` / `mode = strict` |
 | Mapping succeeds but discovery returns zero tests | `DISCOVERY_EMPTY` | `SKIPPED` — or `FULL_SUITE` if `mode=ci`/`strict` | `onDiscoveryEmpty` / `mode` |
+| At least one scanned `.java` file failed to parse (malformed / half-committed / encoding glitch) | `DISCOVERY_INCOMPLETE` (takes precedence over both `DISCOVERY_EMPTY` and `DISCOVERY_SUCCESS`) | `SELECTED` — or `FULL_SUITE` if `mode=ci`/`strict` | `onDiscoveryIncomplete` / `mode` |
 | Mixed diff: Java + unmapped file | `UNMAPPED_FILE` (takes precedence) | `FULL_SUITE` | `onUnmappedFile` — set to `"selected"` to fall through to discovery |
 | `baseRef` not resolvable | `FAILED` | Hard error (prevents silent test skipping in CI) | — |
 | Not a git work tree / JGit I/O error | `FAILED` | Hard error | — |

diff --git a/affected-tests-core/src/main/java/io/affectedtests/core/AffectedTestsEngine.java b/affected-tests-core/src/main/java/io/affectedtests/core/AffectedTestsEngine.java
@@ -91,7 +91,17 @@ public enum EscalationReason {
          * action for {@link Situation#ALL_FILES_OUT_OF_SCOPE} resolved to
          * {@link Action#FULL_SUITE}. v2-only.
          */
-        RUN_ALL_ON_ALL_FILES_OUT_OF_SCOPE
+        RUN_ALL_ON_ALL_FILES_OUT_OF_SCOPE,
+        /**
+         * One or more scanned Java files failed to parse (malformed
+         * source, or a language level beyond {@link io.affectedtests.core.discovery.JavaParsers#LANGUAGE_LEVEL}),
+         * so discovery may have under-reported its selection. The action
+         * for {@link Situation#DISCOVERY_INCOMPLETE} resolved to
+         * {@link Action#FULL_SUITE}, typically via {@link io.affectedtests.core.config.Mode#CI}
+         * or {@link io.affectedtests.core.config.Mode#STRICT} defaults
+         * or an explicit {@code runAllIfNoMatches=true}. v2-only.
+         */
+        RUN_ALL_ON_DISCOVERY_INCOMPLETE
     }
 
     /**
@@ -320,14 +330,57 @@ public AffectedTestsResult run() {
             }
         }
 
+        // Parse failures during index / strategy walks mean the Usage /
+        // Implementation / Transitive tiers may have silently dropped
+        // tests that actually do reference the changed production code.
+        // Surface that as its own situation so callers can pick a policy
+        // (escalate to FULL_SUITE on CI, keep the partial selection on
+        // LOCAL) instead of routing through DISCOVERY_EMPTY /
+        // DISCOVERY_SUCCESS as if the selection were trustworthy. The
+        // count is de-duplicated inside ProjectIndex so a single
+        // unparseable file is not multi-counted across strategies.
+        int parseFailures = index.parseFailureCount();
+        if (parseFailures > 0) {
+            Action action = config.actionFor(Situation.DISCOVERY_INCOMPLETE);
+            log.warn("Affected Tests: discovery observed {} unparseable Java file(s); "
+                            + "selection may be incomplete. Action: {}. See WARN lines above "
+                            + "for the specific files that failed to parse.",
+                    parseFailures, action);
+            if (action == Action.FULL_SUITE || action == Action.SKIPPED) {
+                return emptyResult(Situation.DISCOVERY_INCOMPLETE, action, changedFiles,
+                        mapping.productionClasses(), mapping.testClasses(), buckets);
+            }
+            // SELECTED: honour the situation but fall through to the
+            // shared empty/selected tail below. The tail now picks the
+            // returned situation based on parseFailures so both the
+            // happy path and this one share one log line ({@code "==="
+            // Result: N affected test classes (SITUATION)}) and one
+            // record construction, instead of two near-identical copies
+            // that will drift.
+        }
+
+        Situation finalSituation = parseFailures > 0
+                ? Situation.DISCOVERY_INCOMPLETE
+                : Situation.DISCOVERY_SUCCESS;
+
         if (allTestsToRun.isEmpty()) {
-            Action action = config.actionFor(Situation.DISCOVERY_EMPTY);
-            log.info("Discovery produced no affected tests. Action: {}.", action);
-            return emptyResult(Situation.DISCOVERY_EMPTY, action, changedFiles,
+            // Shared empty tail. For parseFailures == 0 we report
+            // DISCOVERY_EMPTY (the pre-v1.9.22 branch); for
+            // parseFailures > 0 + SELECTED we report DISCOVERY_INCOMPLETE
+            // so --explain, the skip-reason phrase, and the CI log
+            // grep all see a single truthful situation.
+            Situation emptySituation = parseFailures > 0
+                    ? Situation.DISCOVERY_INCOMPLETE
+                    : Situation.DISCOVERY_EMPTY;
+            Action action = config.actionFor(emptySituation);
+            log.info("Discovery produced no affected tests. Situation: {}, Action: {}.",
+                    emptySituation, action);
+            return emptyResult(emptySituation, action, changedFiles,
                     mapping.productionClasses(), mapping.testClasses(), buckets);
         }
 
-        log.info("=== Result: {} affected test classes ===", allTestsToRun.size());
+        log.info("=== Result: {} affected test classes ({}) ===",
+                allTestsToRun.size(), finalSituation);
         // FQNs are derived from diff filenames by the discovery strategies
         // and have not yet been through the AffectedTestTask#isValidFqn
         // gate; sanitise here so an attacker-planted filename like
@@ -344,7 +397,7 @@ public AffectedTestsResult run() {
                 buckets,
                 false,
                 false,
-                Situation.DISCOVERY_SUCCESS,
+                finalSituation,
                 Action.SELECTED,
                 EscalationReason.NONE
         );
@@ -411,6 +464,7 @@ private static EscalationReason legacyReason(Situation situation, Action action)
         return switch (situation) {
             case EMPTY_DIFF -> EscalationReason.RUN_ALL_ON_EMPTY_CHANGESET;
             case DISCOVERY_EMPTY -> EscalationReason.RUN_ALL_IF_NO_MATCHES;
+            case DISCOVERY_INCOMPLETE -> EscalationReason.RUN_ALL_ON_DISCOVERY_INCOMPLETE;
             case UNMAPPED_FILE -> EscalationReason.RUN_ALL_ON_NON_JAVA_CHANGE;
             case ALL_FILES_IGNORED -> EscalationReason.RUN_ALL_ON_ALL_FILES_IGNORED;
             case ALL_FILES_OUT_OF_SCOPE -> EscalationReason.RUN_ALL_ON_ALL_FILES_OUT_OF_SCOPE;