Skip to content

Add 'Size code thresholds from in-codebase distributions' to base/core/quality.md #469

@braboj

Description

@braboj

Pattern from wuseria S147 (#1122) — small but valuable addition to quality conventions.

Pattern

When introducing a numeric threshold to a filter (cutoff, bound, tolerance), measure the actual distribution of the filtered quantity in the working data set (p99, p95, max-legitimate) before picking the threshold. Comment the data source in the constant's docstring so future maintainers can re-validate the threshold against fresh data without re-running the analysis from scratch.

The alternative — picking thresholds from first principles or "feels right" — produces thresholds that mask future bugs by being too loose or break edge cases by being too tight, with no record of how the number was chosen.

Concrete example

wuseria/me-fuji#1122 added _MAX_RUN_LENGTH = 20 to filter vertical chrome out of ridge candidates. The threshold was sized from the reference-set column-run histogram:

median=2 px, p95=3 px, p99=13 px, max legitimate ~17 px
chrome runs measured: 50 px and 265 px

20 admits any real curve ridge (slack above p99=13) but drops chrome (the smallest chrome run is 2.5× the threshold). Calibration validated: only one chart changed, by losing a previously out-of-tolerance reading.

The docstring on the constant records the data source so a future maintainer can rerun the histogram against new reference data and re-validate:

# Real MTF curve ridges measured across the reference set fit within ~3 px
# at p95 and ~13 px at p99; runs >20 px are chrome.
_MAX_RUN_LENGTH: int = 20

Why this belongs upstream

templates/base/core/quality.md §"Magic numbers and magic strings must be named constants" is one half — the other half is how the number gets picked. The named-constant rule prevents scattered literals; the distribution-sized rule prevents arbitrary thresholds masquerading as principled ones.

Proposed wording

Add to templates/base/core/quality.md under the existing "Magic numbers" item or as a sibling item:

  • When introducing a numeric threshold to a filter, cutoff, or bound, size it from the actual distribution of the relevant quantity in the working data set (p95, p99, max-legitimate) — not from first principles or intuition. Document the data source in the constant's docstring so a future maintainer can rerun the analysis against fresh data and re-validate the threshold.

Related to upstream-flag in #468

Companion to #468 ("probe before trusting issue body's mechanism"). Both came from the same session's mtfdigitizer work; both are about grounding decisions in measured runtime data instead of inferred premises.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — nice to havespikeResearch or exploration

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions