Skip to content

feat(WIP): Add Chebyshev distance ($L_∞$ norm) support#286

Open
cbueth wants to merge 36 commits intosdd:masterfrom
cbueth:feature/chebychev
Open

feat(WIP): Add Chebyshev distance ($L_∞$ norm) support#286
cbueth wants to merge 36 commits intosdd:masterfrom
cbueth:feature/chebychev

Conversation

@cbueth
Copy link

@cbueth cbueth commented Feb 7, 2026

This PR adds support for Chebyshev distance to kiddo's distance metrics, float and fixed. This required shaking two fundamental assumptions in the code:

  1. The leaf-node distance calculations had hard-coded sum-based aggregation
  2. The pruning logic distance estimation had hard-coded sum-based lower-bound computation

Initially, when plugging my Chebyshev struct I noticed different distances of the k-th distance when using kdtree::nearest_n(). Working on (1.) and (2.), the two points brought Chebychev on pair with brute force results (using the dist what did not work with nearest_n before) and the scipy python implementation, which uses a cKDTree under the hood which has been validated extensively (https://github.com/scipy/scipy/tree/v1.17.0/scipy/spatial/ckdtree/src). With the changes, all other tests for the existing Manhattan and SquaredEuclidean metric still worked, none failed, but due to performance considerations, already optimised code and pruning, I decided to add a distinction for max-based metrics.

1. Adapt Leaf-Node Distance Calculations

The existing code hard-coded sum-based distance aggregation += for all metrics, which works for $\mathcal{L}_1$ (Manhattan) and $\mathcal{L}_2$ (SquaredEuclidean), but is incorrect for $\mathcal{L}_∞$ (Chebyshev).
The leaf node remainder loops (i.e. nearest_n_within, best_n_within and dists_for_chunk) now use distance = D::accumulate(distance, D::dist1()) for Manhattan/SquaredEuclidean/Chebyshev. Secondly, because Chebyshev uses the actual distance (not squared), points exactly at radius must be included if distance <= radius. This has been changed at multiple places and test helpers, please let me know if there is a good reason to keep if distance < radius. I think this latter change should not impair performance largely, but changes how points exactly on the border are treated.

Tests

The changes have been tested with several tests:

  • Specific metric tests directly testing dist and dist1
  • Parametrised nearest_n tests (2 tree types × 2 scenarios × 6 n values × 4 dimensions = 96)
  • within query tests
  • Checked that the 3-th closest distance of nearest_n matches with scipy for querying once from each point in the input data:
    • Test scenario data (NoTies/Ties with 1-4 dimensions)
    • Gaussian data: 200 points, dimensions 1-5, Chebyshev, SquaredEuclidean and Manhattan metrics

2. Adapt Pruning Logic for $\mathcal{L}_∞$ Distance

Testing with the previous changes showed, queries still failed for Chebyshev when scaling to 2000 Gaussian points. Querying from most points of the dataset the 3-th nearest distance was correct with high >1e-10 precision, but single ones were off by ±0.03 which is quite relevant. With higher dimensions this problem increased. With original rd += delta pruning on 2000 Gaussian points, Chebyshev errors (considered affected if diff to scipy implementation is >1e10):

  • 2D: 0.009 max error, 0.2% affected queries (4/2000)
  • 3D: 0.084 max error, 1.2% affected queries (24/2000)
  • 4D: 0.084 max error, 2.9% affected queries (58/2000)
  • 5D: 0.213 max error, 5.6% affected queries (112/2000)
  • 6D: 0.373 max error, 8.5% affected queries (170/2000)
  • 7D: 0.250 max error, 11.05% affected queries (221/2000)
  • 8D: 0.240 max error, 15.05% affected queries (301/2000)

The issue was in the pruning logic. The pruning logic uses distance estimate rd to decide whether to explore subtrees. Previously, this always used + aggregation rd = Axis::rd_update(rd, D::dist1(new_off, old_off)); where rd_update always returned rd + delta. This is mathematically incorrect for Chebyshev:

  • For $\mathcal{L}_1$ / $\mathcal{L}_2$: rd = $\sum$|axis_diff| correct lower bound
  • For $\mathcal{L}_∞$: rd = should be max(|axis_diff|)

This means the pruning needs to be metric aware. The simplest solution I see is extending the DistanceMetric trait with metric-specific aggregation:

pub trait DistanceMetric<A, const K: usize> {
    fn dist(a: &[A; K], b: &[A; K]) -> A;
    fn dist1(a: A, b: A) -> A;
    fn accumulate(rd: A, delta: A) -> A;
}

Manhattan/SquaredEuclidean keep fn accumulate(rd: A, delta: A) -> A { rd + delta }. Chebyshev now has fn accumulate(rd: A, delta: A) -> A { rd.max(delta) }. At the same time this deprecates rd_update(rd, D::dist1(...));.

Now after this, all tests from before are still successful and Manhattan and SquaredEuclidean have 0% errors across all dimensions as before, but now Chebychev too, meaning the pruning logic is adapted to $\mathcal{L}_∞$, too.


I am happy to receive feedback on the changes so I can possibly add some of them (like moving tests to a preferred place). I am not too happy with extending the DistanceMetric trait, breaking existing ones, but as it will add new functionality this is a reasonable price to pay. I have double checked the pruning strictness and tried to nail down where to use accumulate and think most code stays untouched. (Cannot remove the (WIP) from title).

sdd and others added 21 commits December 8, 2025 07:07
fixes bug where nearest_n_within accessed self.content_items instead of
remainder_items for remainder elements, causing incorrect results when
dataset size % CHUNK_SIZE != 0. Also removed unnecessary unsafe code in
best_n_within.

Signed-off-by: Markus Zoppelt <markus.zoppelt@helsing.ai>
tests nearest_n_within with size-33 dataset to verify items in remainder
region are found correctly. Before the fix, this would access
self.content_items[0] instead of remainder_items[0], returning wrong items.

Signed-off-by: Markus Zoppelt <markus.zoppelt@helsing.ai>
If leaf_items.len() exceeds u32::MAX (~4.3 billion), this silently
truncates. For datasets with billions of points, this is realistic and
causes severe corruption.
* release-plz checkout depth fixed so that full changelogs are generated
* add commitlint with conventional commits config
…thin_unsorted_iter

within_unsorted_iter is modified to decouple the lifetime
of the iterator from that of the query by performing a
generally very cheap copy just once at the start of the query
@cbueth cbueth marked this pull request as draft February 7, 2026 22:49
@cbueth cbueth marked this pull request as ready for review February 8, 2026 00:44
Copy link
Owner

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carlson, I really appreciate that you've taken the time to craft such a thoughtful PR. It's not often that I get contributions that are as thorough and considered as this one is - thanks very much!

I have a small number of points:

  • The switch from Axis::rd_update to DistanceMetric::accumulatemakes total sense. I think, if we add a default implementation for DistanceMetric::accumulate, which contains the saturating_add / addition-based method that the SquaredEuclidean and Manhattan metrics implement, then this would make the change non-breaking for anyone who depends on the DistanceMetric trait as-is.
  • I do place a lot of weight in strictly conforming to semantic versioning, and so I prefer the rd_update deprecation approach that you went with in the PR rather than fully getting rid of it as suggested in the comment, whilst we're on the 5.x.x branch. v6 is an almost complete rewrite that I've been working on since September, and I'll be incorporating DistanceMetric::accumulate into there so we can carry over Chebyshev, and eliminating rd_update in v6.
  • Changing dist < radius to dist <= radius could be considered a breaking change, but after reading up on it, <= is probably what most users would be expecting already anyway as that aligns with mathematical / geometric definitions, and what most other libraries seem to be doing, even if the word "within" may imply <. So I'm happy to treat this point as a bug-fix rather than a breaking functional change. Adding a clarification to the doc-comments of the radius query methods themselves would help with this clarification.

@codecov
Copy link

codecov bot commented Feb 8, 2026

Codecov Report

❌ Patch coverage is 95.79125% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.99%. Comparing base (2056051) to head (df79ff7).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/fixed/distance.rs 84.68% 15 Missing and 2 partials ⚠️
src/float/distance.rs 98.24% 3 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #286      +/-   ##
==========================================
+ Coverage   94.89%   94.99%   +0.09%     
==========================================
  Files          54       54              
  Lines        5705     6273     +568     
  Branches     5705     6273     +568     
==========================================
+ Hits         5414     5959     +545     
- Misses        273      289      +16     
- Partials       18       25       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sdd
Copy link
Owner

sdd commented Feb 8, 2026

One more thing - would you be able to re-raise this PR with your changes but with the target as v5.x.x rather than master please? There are some changes on master that were merged in but I wanted to hold back from releasing on v5 as they are breaking. Hopefully I'll finish the v6 changes soon and I can return to the more sensible state of master being the target branch for PRs.

@cbueth
Copy link
Author

cbueth commented Feb 8, 2026

Thanks for the appreciative response and the suggestions to this and #287, Scott! I agree with all of them and will open two rebased PRs with the integrated changes, surely sometime this week.

  • I tried to have the default behaviour with +, but did somehow not find out to just use saturating_add, will do.
  • Finally, I was not able to find a test where < vs <= makes a difference, but programmatically I agree it could strictly be the right choice. As of now I changed some back to <. You can see the two places it was added (D::IS_MAX_BASED removed in this commit 5b02ece. NON_STRICT_PRUNING would be a better name). I'd suggest you treat this as a separate issue.

Great news about the rewrite nearly done, good work. When choosing the branch to build on, I saw there were quite some branches with restructured and rewritten code. Feel free to let me know if there is any isolated issue to work on or changes to be reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments