From 9b9c145f3b6fdee9dba74db22522601b05a22d48 Mon Sep 17 00:00:00 2001 From: Dannon Baker Date: Thu, 25 Jun 2026 14:25:08 +0200 Subject: [PATCH] Warn the brain off coercing NA/missing padj when filtering stats tables DESeq2 writes NA in the padj column for independent-filtered and outlier genes, and a bare awk filter like `$7+0 < 0.05` coerces those NA rows to 0, so every one slips through as "significant" with no error to catch -- in #355 that turned a ~1,560-gene contrast into a reported 8,738 that then propagated downstream. Added a subsection to the verification discipline prompt block: when thresholding a p-value/padj/FDR column, drop non-numeric/missing rows explicitly (prefer Python/R with real NA handling, or guard the column in awk) and sanity-check the surviving count. Scoped to significance filtering with an escape hatch for a deliberate zero-imputation convention, and the awk example flags its own limitation (it only excludes literal NA -- header, blanks, and `.` coerce the same way) so it doesn't get cargo-culted. Mechanical test asserts the load-bearing strings ship in the prompt. --- extensions/loom/context.ts | 31 ++++++++++++++++++++++++++++++ tests/verification-context.test.ts | 18 +++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/extensions/loom/context.ts b/extensions/loom/context.ts index 6f000d45..2fced49a 100644 --- a/extensions/loom/context.ts +++ b/extensions/loom/context.ts @@ -726,6 +726,37 @@ unverified. Do **not** mark the step \`- [x]\` and do **not** say "done" or "complete" for that artifact. Say "created but not verified" and ask for the missing input or approval to change scope. +### Filtering significance tables — never coerce missing values + +When you derive a filtered subset or a count from a statistics table by +thresholding a p-value / \`padj\` / FDR column (or any column where a +missing value is not the same as \`0\`), do not let a bare numeric +coercion turn a missing or non-numeric value into a number. In awk, +\`"NA"+0\` is \`0\`, so \`$7+0 < 0.05\` counts every \`NA\` row as +passing — and the header row, blank fields, and \`.\` coerce to \`0\` the +same way. A DESeq2 results table writes \`NA\` in \`padj\` for +independent-filtered and outlier genes, so this trap silently inflates +the "significant" count with no error to catch; the wrong number looks +completely plausible and flows into downstream figures and tables. + +Exclude non-numeric/missing rows explicitly before comparing: + +- Prefer Python/R with real NA handling (pandas + \`dropna(subset=["padj"])\`, R \`!is.na()\` / \`na.rm\`) — the reliable + default for anything past a trivial one-column filter. +- In awk, guard the column so non-numeric values can't reach the + comparison. \`awk '$7 != "NA" && $7+0 < 0.05'\` only excludes the + literal \`NA\`; the header, blanks, \`.\`, and \`NaN\` still coerce to + \`0\`, so validate the field is actually numeric (or skip the header + and the known sentinels) rather than trusting \`$7+0\` alone. + +Legitimate numeric filtering is fine, and a column with a deliberate +zero-imputation convention is the user's call — the rule is only that a +*missing* value must be dropped, not silently coerced to \`0\`, unless +the user has defined how to impute it. After filtering, sanity-check the +surviving row count against expectation (the row/object-count check +above); an implausible jump usually means a sentinel slipped through. + ### Notebook requirement Every new plan step should include a concrete \`Verification:\` sub-bullet diff --git a/tests/verification-context.test.ts b/tests/verification-context.test.ts index df86dd6f..18021843 100644 --- a/tests/verification-context.test.ts +++ b/tests/verification-context.test.ts @@ -27,6 +27,24 @@ describe("buildVerificationDisciplineBlock", () => { expect(ctx).toContain("element count"); }); + it("warns against coercing missing values when filtering significance tables", () => { + const ctx = buildVerificationDisciplineBlock(); + // Normalize whitespace so these load-bearing phrases survive prose reflow. + const flat = ctx.replace(/\s+/g, " "); + + expect(flat).toContain("Filtering significance tables — never coerce missing values"); + expect(flat).toContain('`"NA"+0` is `0`'); + expect(flat).toContain("$7+0 < 0.05"); + expect(flat).toContain('writes `NA` in `padj`'); + expect(flat).toContain('awk \'$7 != "NA" && $7+0 < 0.05\''); + // The awk example must flag its own limitation (it only catches literal NA). + expect(flat).toContain("only excludes the literal `NA`"); + expect(flat).toContain('dropna(subset=["padj"])'); + expect(flat).toContain("Legitimate numeric filtering is fine"); + expect(flat).toContain("zero-imputation convention is the user's call"); + expect(flat).toContain("sanity-check the surviving row count"); + }); + it("is wired into the assembled system prompt", async () => { const handlers = new Map Promise>(); const pi = {