feat(WIP): Add Chebyshev distance ($L_∞$ norm) support by cbueth · Pull Request #286 · sdd/kiddo

cbueth · 2026-02-07T22:49:21Z

This PR adds support for Chebyshev distance to kiddo's distance metrics, float and fixed. This required shaking two fundamental assumptions in the code:

The leaf-node distance calculations had hard-coded sum-based aggregation
The pruning logic distance estimation had hard-coded sum-based lower-bound computation

Initially, when plugging my Chebyshev struct I noticed different distances of the k-th distance when using kdtree::nearest_n(). Working on (1.) and (2.), the two points brought Chebychev on pair with brute force results (using the dist what did not work with nearest_n before) and the scipy python implementation, which uses a cKDTree under the hood which has been validated extensively (https://github.com/scipy/scipy/tree/v1.17.0/scipy/spatial/ckdtree/src). With the changes, all other tests for the existing Manhattan and SquaredEuclidean metric still worked, none failed, but due to performance considerations, already optimised code and pruning, I decided to add a distinction for max-based metrics.

1. Adapt Leaf-Node Distance Calculations

The existing code hard-coded sum-based distance aggregation += for all metrics, which works for $\mathcal{L}_1$ (Manhattan) and $\mathcal{L}_2$ (SquaredEuclidean), but is incorrect for $\mathcal{L}_∞$ (Chebyshev).
The leaf node remainder loops (i.e. nearest_n_within, best_n_within and dists_for_chunk) now use distance = D::accumulate(distance, D::dist1()) for Manhattan/SquaredEuclidean/Chebyshev. Secondly, because Chebyshev uses the actual distance (not squared), points exactly at radius must be included if distance <= radius. This has been changed at multiple places and test helpers, please let me know if there is a good reason to keep if distance < radius. I think this latter change should not impair performance largely, but changes how points exactly on the border are treated.

Tests

The changes have been tested with several tests:

Specific metric tests directly testing dist and dist1
Parametrised nearest_n tests (2 tree types × 2 scenarios × 6 n values × 4 dimensions = 96)
within query tests
Checked that the 3-th closest distance of nearest_n matches with scipy for querying once from each point in the input data:
- Test scenario data (NoTies/Ties with 1-4 dimensions)
- Gaussian data: 200 points, dimensions 1-5, Chebyshev, SquaredEuclidean and Manhattan metrics

2. Adapt Pruning Logic for $\mathcal{L}_∞$ Distance

Testing with the previous changes showed, queries still failed for Chebyshev when scaling to 2000 Gaussian points. Querying from most points of the dataset the 3-th nearest distance was correct with high >1e-10 precision, but single ones were off by ±0.03 which is quite relevant. With higher dimensions this problem increased. With original rd += delta pruning on 2000 Gaussian points, Chebyshev errors (considered affected if diff to scipy implementation is >1e10):

2D: 0.009 max error, 0.2% affected queries (4/2000)
3D: 0.084 max error, 1.2% affected queries (24/2000)
4D: 0.084 max error, 2.9% affected queries (58/2000)
5D: 0.213 max error, 5.6% affected queries (112/2000)
6D: 0.373 max error, 8.5% affected queries (170/2000)
7D: 0.250 max error, 11.05% affected queries (221/2000)
8D: 0.240 max error, 15.05% affected queries (301/2000)

The issue was in the pruning logic. The pruning logic uses distance estimate rd to decide whether to explore subtrees. Previously, this always used + aggregation rd = Axis::rd_update(rd, D::dist1(new_off, old_off)); where rd_update always returned rd + delta. This is mathematically incorrect for Chebyshev:

For $\mathcal{L}_1$ / $\mathcal{L}_2$: rd = $\sum$|axis_diff| correct lower bound
For $\mathcal{L}_∞$: rd = should be max(|axis_diff|)

This means the pruning needs to be metric aware. The simplest solution I see is extending the DistanceMetric trait with metric-specific aggregation:

pub trait DistanceMetric<A, const K: usize> {
    fn dist(a: &[A; K], b: &[A; K]) -> A;
    fn dist1(a: A, b: A) -> A;
    fn accumulate(rd: A, delta: A) -> A;
}

Manhattan/SquaredEuclidean keep fn accumulate(rd: A, delta: A) -> A { rd + delta }. Chebyshev now has fn accumulate(rd: A, delta: A) -> A { rd.max(delta) }. At the same time this deprecates rd_update(rd, D::dist1(...));.

Now after this, all tests from before are still successful and Manhattan and SquaredEuclidean have 0% errors across all dimensions as before, but now Chebychev too, meaning the pruning logic is adapted to $\mathcal{L}_∞$, too.

I am happy to receive feedback on the changes so I can possibly add some of them (like moving tests to a preferred place). I am not too happy with extending the DistanceMetric trait, breaking existing ones, but as it will add new functionality this is a reasonable price to pay. I have double checked the pruning strictness and tried to nail down where to use accumulate and think most code stays untouched. (Cannot remove the (WIP) from title).

…ded in Rust 1.54 instead

fixes bug where nearest_n_within accessed self.content_items instead of remainder_items for remainder elements, causing incorrect results when dataset size % CHUNK_SIZE != 0. Also removed unnecessary unsafe code in best_n_within. Signed-off-by: Markus Zoppelt <markus.zoppelt@helsing.ai>

tests nearest_n_within with size-33 dataset to verify items in remainder region are found correctly. Before the fix, this would access self.content_items[0] instead of remainder_items[0], returning wrong items. Signed-off-by: Markus Zoppelt <markus.zoppelt@helsing.ai>

If leaf_items.len() exceeds u32::MAX (~4.3 billion), this silently truncates. For datasets with billions of points, this is realistic and causes severe corruption.

* release-plz checkout depth fixed so that full changelogs are generated * add commitlint with conventional commits config

…thin_unsorted_iter within_unsorted_iter is modified to decouple the lifetime of the iterator from that of the query by performing a generally very cheap copy just once at the start of the query

see RustCrypto/utils#1304

sdd

Carlson, I really appreciate that you've taken the time to craft such a thoughtful PR. It's not often that I get contributions that are as thorough and considered as this one is - thanks very much!

I have a small number of points:

The switch from Axis::rd_update to DistanceMetric::accumulatemakes total sense. I think, if we add a default implementation for DistanceMetric::accumulate, which contains the saturating_add / addition-based method that the SquaredEuclidean and Manhattan metrics implement, then this would make the change non-breaking for anyone who depends on the DistanceMetric trait as-is.
I do place a lot of weight in strictly conforming to semantic versioning, and so I prefer the rd_update deprecation approach that you went with in the PR rather than fully getting rid of it as suggested in the comment, whilst we're on the 5.x.x branch. v6 is an almost complete rewrite that I've been working on since September, and I'll be incorporating DistanceMetric::accumulate into there so we can carry over Chebyshev, and eliminating rd_update in v6.
Changing dist < radius to dist <= radius could be considered a breaking change, but after reading up on it, <= is probably what most users would be expecting already anyway as that aligns with mathematical / geometric definitions, and what most other libraries seem to be doing, even if the word "within" may imply <. So I'm happy to treat this point as a bug-fix rather than a breaking functional change. Adding a clarification to the doc-comments of the radius query methods themselves would help with this clarification.

codecov · 2026-02-08T18:23:39Z

Codecov Report

❌ Patch coverage is 95.79125% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.99%. Comparing base (2056051) to head (df79ff7).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/fixed/distance.rs	84.68%	15 Missing and 2 partials ⚠️
src/float/distance.rs	98.24%	3 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #286      +/-   ##
==========================================
+ Coverage   94.89%   94.99%   +0.09%     
==========================================
  Files          54       54              
  Lines        5705     6273     +568     
  Branches     5705     6273     +568     
==========================================
+ Hits         5414     5959     +545     
- Misses        273      289      +16     
- Partials       18       25       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sdd · 2026-02-08T19:03:42Z

One more thing - would you be able to re-raise this PR with your changes but with the target as v5.x.x rather than master please? There are some changes on master that were merged in but I wanted to hold back from releasing on v5 as they are breaking. Hopefully I'll finish the v6 changes soon and I can return to the more sensible state of master being the target branch for PRs.

cbueth · 2026-02-08T20:03:46Z

Thanks for the appreciative response and the suggestions to this and #287, Scott! I agree with all of them and will open two rebased PRs with the integrated changes, surely sometime this week.

I tried to have the default behaviour with +, but did somehow not find out to just use saturating_add, will do.
Finally, I was not able to find a test where < vs <= makes a difference, but programmatically I agree it could strictly be the right choice. As of now I changed some back to <. You can see the two places it was added (D::IS_MAX_BASED removed in this commit 5b02ece. NON_STRICT_PRUNING would be a better name). I'd suggest you treat this as a separate issue.

Great news about the rewrite nearly done, good work. When choosing the branch to build on, I saw there were quite some branches with restructured and rewritten code. Feel free to let me know if there is any isolated issue to work on or changes to be reviewed.

…lidean distance metrics

- deprecate `rd_update` with `D::accumulate` for consistent handling of sum-based and max-based metrics - conditional logic for SIMD (L1/L2) and general L∞ - differentiate distance accumulation behaviour

- integration `nearest_n` tests (Chebyshev, Manhattan, SquaredEuclidean).

…doc, add Gaussian scenario to tests

… trait - improve test coverage

sdd and others added 21 commits December 8, 2025 07:07

ci: Update CI workflow triggers to include PR and workflow_dispatch

aac7f18

style: remove unnecessary parentheses

46b0c56

ci: permit coverage to run for PRs as well

809d457

deps: remove doc-comment dependency and use doc attribute that was ad…

ed3b7d7

…ded in Rust 1.54 instead

chore: use doc attribute instead of doc_comment!

18c8bb3

style: fix formatting

b4a40ce

fix: use try_from() with error for leaf_items.len()

6c5bcbf

If leaf_items.len() exceeds u32::MAX (~4.3 billion), this silently truncates. For datasets with billions of points, this is realistic and causes severe corruption.

chore(deps): update actions/checkout action to v6

f3f2ec3

chore(deps): update codspeedhq/action action to v4

5e3ee0f

chore(deps): update ad-m/github-push-action action to v1

695c97c

chore(deps): update rust crate rstest to 0.26

9a1e218

chore(deps): update rust crate codspeed-criterion-compat to v4

db1fbe9

ci: fix release-plz and add commitlint

aa7b565

* release-plz checkout depth fixed so that full changelogs are generated * add commitlint with conventional commits config

Added WithinUnsortedIterOwned

6c2940b

fix: update to use transform function

ba36d8d

refactor: remove within_unsorted_iter_owned in favour of modifying wi…

7b88de8

…thin_unsorted_iter within_unsorted_iter is modified to decouple the lifetime of the iterator from that of the query by performing a generally very cheap copy just once at the start of the query

docs: update Cargo.toml, changelog and docs for 5.2.3

6e1afdd

deps: bump cmov to 0.4 as all other versions were yanked

e746edf

see RustCrypto/utils#1304

docs: update changelog, readme, and Cargo.toml for 5.2.4 release

bc89bc5

cbueth marked this pull request as draft February 7, 2026 22:49

cbueth marked this pull request as ready for review February 8, 2026 00:44

cbueth mentioned this pull request Feb 8, 2026

feat: Generalised Minkowski Metric (L_p norm) #287

Open

1 task

sdd requested changes Feb 8, 2026

View reviewed changes

chore: update changelog

f87b965

test: Add coverage for Manhattan and Squared Euclidean distance metrics

47db176

cbueth added 11 commits February 12, 2026 22:56

feat: Add Chebyshev distance metric and test coverage

8178908

test: add integration tests for Chebyshev, Manhattan, and Squared Euc…

57dff79

…lidean distance metrics

inclusive radius matching and leaf note remainder loops

b1fe052

fix over-pruning for L_inf

4a59cb8

Add D::accumulate and D::IS_MAX_BASED

28859a9

- deprecate `rd_update` with `D::accumulate` for consistent handling of sum-based and max-based metrics - conditional logic for SIMD (L1/L2) and general L∞ - differentiate distance accumulation behaviour

feat: add fixed Chebyshev distance metric

c8dcb3e

- integration `nearest_n` tests (Chebyshev, Manhattan, SquaredEuclidean).

refactor: in-loop accumulation for max-based metrics

3f32f6b

unify distance accumulation logic with D::accumulate

93b98c1

remove D::IS_MAX_BASED, unify heap logic, improve DistanceMetric …

c934656

…doc, add Gaussian scenario to tests

change test comment & lint

fa3361f

refactor: make metric property tests reusable

7ebb80b

cbueth force-pushed the feature/chebychev branch from df79ff7 to 7ebb80b Compare February 12, 2026 22:01

chore: add default implementation of accumulate to DistanceMetric…

44a0f1e

… trait - improve test coverage

cbueth mentioned this pull request Feb 12, 2026

feat: Add Chebyshev distance (L_∞ norm) support #290

Open

chore: saturating add for fixed metrics

7adffe8

cbueth mentioned this pull request Feb 13, 2026

feat: Generalised Minkowski Metric (L_p norm) #291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(WIP): Add Chebyshev distance ($L_∞$ norm) support#286

feat(WIP): Add Chebyshev distance ($L_∞$ norm) support#286
cbueth wants to merge 36 commits intosdd:masterfrom
cbueth:feature/chebychev

cbueth commented Feb 7, 2026 •

edited

Loading

Uh oh!

sdd left a comment

Uh oh!

codecov bot commented Feb 8, 2026

Uh oh!

sdd commented Feb 8, 2026

Uh oh!

cbueth commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

cbueth commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Adapt Leaf-Node Distance Calculations

Tests

2. Adapt Pruning Logic for $\mathcal{L}_∞$ Distance

Uh oh!

sdd left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 8, 2026

Codecov Report

Uh oh!

sdd commented Feb 8, 2026

Uh oh!

cbueth commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

cbueth commented Feb 7, 2026 •

edited

Loading