feat(WIP): Add Chebyshev distance ($L_∞$ norm) support#286
feat(WIP): Add Chebyshev distance ($L_∞$ norm) support#286cbueth wants to merge 36 commits intosdd:masterfrom
Conversation
…ded in Rust 1.54 instead
fixes bug where nearest_n_within accessed self.content_items instead of remainder_items for remainder elements, causing incorrect results when dataset size % CHUNK_SIZE != 0. Also removed unnecessary unsafe code in best_n_within. Signed-off-by: Markus Zoppelt <markus.zoppelt@helsing.ai>
tests nearest_n_within with size-33 dataset to verify items in remainder region are found correctly. Before the fix, this would access self.content_items[0] instead of remainder_items[0], returning wrong items. Signed-off-by: Markus Zoppelt <markus.zoppelt@helsing.ai>
If leaf_items.len() exceeds u32::MAX (~4.3 billion), this silently truncates. For datasets with billions of points, this is realistic and causes severe corruption.
* release-plz checkout depth fixed so that full changelogs are generated * add commitlint with conventional commits config
…thin_unsorted_iter within_unsorted_iter is modified to decouple the lifetime of the iterator from that of the query by performing a generally very cheap copy just once at the start of the query
sdd
left a comment
There was a problem hiding this comment.
Carlson, I really appreciate that you've taken the time to craft such a thoughtful PR. It's not often that I get contributions that are as thorough and considered as this one is - thanks very much!
I have a small number of points:
- The switch from
Axis::rd_updateto DistanceMetric::accumulatemakes total sense. I think, if we add a default implementation forDistanceMetric::accumulate, which contains thesaturating_add/ addition-based method that theSquaredEuclideanandManhattanmetrics implement, then this would make the change non-breaking for anyone who depends on theDistanceMetrictrait as-is. - I do place a lot of weight in strictly conforming to semantic versioning, and so I prefer the rd_update deprecation approach that you went with in the PR rather than fully getting rid of it as suggested in the comment, whilst we're on the 5.x.x branch. v6 is an almost complete rewrite that I've been working on since September, and I'll be incorporating
DistanceMetric::accumulateinto there so we can carry over Chebyshev, and eliminatingrd_updatein v6. - Changing
dist < radiustodist <= radiuscould be considered a breaking change, but after reading up on it,<=is probably what most users would be expecting already anyway as that aligns with mathematical / geometric definitions, and what most other libraries seem to be doing, even if the word "within" may imply<. So I'm happy to treat this point as a bug-fix rather than a breaking functional change. Adding a clarification to the doc-comments of the radius query methods themselves would help with this clarification.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #286 +/- ##
==========================================
+ Coverage 94.89% 94.99% +0.09%
==========================================
Files 54 54
Lines 5705 6273 +568
Branches 5705 6273 +568
==========================================
+ Hits 5414 5959 +545
- Misses 273 289 +16
- Partials 18 25 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
One more thing - would you be able to re-raise this PR with your changes but with the target as |
|
Thanks for the appreciative response and the suggestions to this and #287, Scott! I agree with all of them and will open two rebased PRs with the integrated changes, surely sometime this week.
Great news about the rewrite nearly done, good work. When choosing the branch to build on, I saw there were quite some branches with restructured and rewritten code. Feel free to let me know if there is any isolated issue to work on or changes to be reviewed. |
…lidean distance metrics
- deprecate `rd_update` with `D::accumulate` for consistent handling of sum-based and max-based metrics - conditional logic for SIMD (L1/L2) and general L∞ - differentiate distance accumulation behaviour
- integration `nearest_n` tests (Chebyshev, Manhattan, SquaredEuclidean).
…doc, add Gaussian scenario to tests
df79ff7 to
7ebb80b
Compare
… trait - improve test coverage
This PR adds support for Chebyshev distance to kiddo's distance metrics, float and fixed. This required shaking two fundamental assumptions in the code:
Initially, when plugging my
Chebyshevstruct I noticed different distances of the k-th distance when usingkdtree::nearest_n(). Working on (1.) and (2.), the two points brought Chebychev on pair with brute force results (using thedistwhat did not work withnearest_nbefore) and the scipy python implementation, which uses a cKDTree under the hood which has been validated extensively (https://github.com/scipy/scipy/tree/v1.17.0/scipy/spatial/ckdtree/src). With the changes, all other tests for the existingManhattanandSquaredEuclideanmetric still worked, none failed, but due to performance considerations, already optimised code and pruning, I decided to add a distinction for max-based metrics.1. Adapt Leaf-Node Distance Calculations
The existing code hard-coded sum-based distance aggregation$\mathcal{L}_1$ (Manhattan) and $\mathcal{L}_2$ (SquaredEuclidean), but is incorrect for $\mathcal{L}_∞$ (Chebyshev).
+=for all metrics, which works forThe leaf node remainder loops (i.e.
nearest_n_within,best_n_withinanddists_for_chunk) now usedistance = D::accumulate(distance, D::dist1())for Manhattan/SquaredEuclidean/Chebyshev.Secondly, because Chebyshev uses the actual distance (not squared), points exactly at radius must be includedif distance <= radius. This has been changed at multiple places and test helpers, please let me know if there is a good reason to keepif distance < radius. I think this latter change should not impair performance largely, but changes how points exactly on the border are treated.Tests
The changes have been tested with several tests:
distanddist1nearest_ntests (2 tree types × 2 scenarios × 6 n values × 4 dimensions = 96)withinquery testsnearest_nmatches with scipy for querying once from each point in the input data:2. Adapt Pruning Logic for$\mathcal{L}_∞$ Distance
Testing with the previous changes showed, queries still failed for Chebyshev when scaling to 2000 Gaussian points. Querying from most points of the dataset the 3-th nearest distance was correct with high >1e-10 precision, but single ones were off by ±0.03 which is quite relevant. With higher dimensions this problem increased. With original
rd += deltapruning on 2000 Gaussian points, Chebyshev errors (considered affected if diff to scipy implementation is >1e10):The issue was in the pruning logic. The pruning logic uses distance estimate
rdto decide whether to explore subtrees. Previously, this always used+aggregationrd = Axis::rd_update(rd, D::dist1(new_off, old_off));whererd_updatealways returnedrd + delta. This is mathematically incorrect for Chebyshev:This means the pruning needs to be metric aware. The simplest solution I see is extending the
DistanceMetrictrait with metric-specific aggregation:Manhattan/SquaredEuclidean keep
fn accumulate(rd: A, delta: A) -> A { rd + delta }. Chebyshev now hasfn accumulate(rd: A, delta: A) -> A { rd.max(delta) }. At the same time this deprecatesrd_update(rd, D::dist1(...));.Now after this, all tests from before are still successful and Manhattan and SquaredEuclidean have 0% errors across all dimensions as before, but now Chebychev too, meaning the pruning logic is adapted to$\mathcal{L}_∞$ , too.
I am happy to receive feedback on the changes so I can possibly add some of them (like moving tests to a preferred place). I am not too happy with extending the
DistanceMetrictrait, breaking existing ones, but as it will add new functionality this is a reasonable price to pay. I have double checked the pruning strictness and tried to nail down where to use accumulate and think most code stays untouched. (Cannot remove the (WIP) from title).