Skip to content

draft: RFC for Horizontal scalability for ahnlich#276

Draft
jimezesinachi wants to merge 7 commits intomainfrom
rfc/replication
Draft

draft: RFC for Horizontal scalability for ahnlich#276
jimezesinachi wants to merge 7 commits intomainfrom
rfc/replication

Conversation

@jimezesinachi
Copy link
Collaborator

@jimezesinachi jimezesinachi commented Nov 17, 2025

Signed-off-by: Jim Ezesinachi <ezesinachijim@gmail.com>
@jimezesinachi jimezesinachi changed the title draft: RFC for [Horizontal scalability for ahnlich](https://github.com/deven96/ahnlich/issues/271) draft: RFC for Horizontal scalability for ahnlich Nov 17, 2025
@github-actions
Copy link

github-actions bot commented Nov 17, 2025

Test Results

0 tests   - 264   0 ✅  - 264   0s ⏱️ - 11m 8s
0 suites  -  35   0 💤 ±  0 
0 files    -   4   0 ❌ ±  0 

Results for commit 7f393b2. ± Comparison against base commit 7fdb808.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 17, 2025

Benchmark Results

group                                                        main                                   pr
-----                                                        ----                                   --
predicate_query_with_index/size_100                          1.03      3.1±0.01µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec
predicate_query_with_index/size_1000                         1.03     33.8±0.08µs        ? ?/sec    1.00     32.8±0.02µs        ? ?/sec
predicate_query_with_index/size_10000                        1.02    393.1±0.16µs        ? ?/sec    1.00    386.0±0.55µs        ? ?/sec
predicate_query_with_index/size_100000                       1.01      4.9±0.00ms        ? ?/sec    1.00      4.8±0.02ms        ? ?/sec
predicate_query_without_index/size_100                       1.00      7.2±0.00µs        ? ?/sec    1.00      7.2±0.03µs        ? ?/sec
predicate_query_without_index/size_1000                      1.00     99.3±0.09µs        ? ?/sec    1.01    100.3±0.26µs        ? ?/sec
predicate_query_without_index/size_10000                     1.07    836.3±3.19µs        ? ?/sec    1.00    779.2±3.22µs        ? ?/sec
predicate_query_without_index/size_100000                    1.11     13.6±0.10ms        ? ?/sec    1.00     12.3±0.06ms        ? ?/sec
store_batch_insertion_without_predicates/size_100            1.02    229.6±1.39µs        ? ?/sec    1.00    226.2±1.71µs        ? ?/sec
store_batch_insertion_without_predicates/size_1000           1.02   1234.6±8.07µs        ? ?/sec    1.00   1213.0±9.82µs        ? ?/sec
store_batch_insertion_without_predicates/size_10000          1.00     13.0±0.08ms        ? ?/sec    1.01     13.1±0.05ms        ? ?/sec
store_batch_insertion_without_predicates/size_100000         1.00    129.3±0.43ms        ? ?/sec    1.01    131.2±0.27ms        ? ?/sec
store_retrieval_no_condition/size_100                        1.03    112.7±0.57µs        ? ?/sec    1.00    109.5±1.04µs        ? ?/sec
store_retrieval_no_condition/size_1000                       1.01    779.5±9.55µs        ? ?/sec    1.00   773.5±12.64µs        ? ?/sec
store_retrieval_no_condition/size_10000                      1.01      7.1±0.04ms        ? ?/sec    1.00      7.0±0.02ms        ? ?/sec
store_retrieval_no_condition/size_100000                     1.00     78.3±0.22ms        ? ?/sec    1.00     78.3±0.17ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_100                   1.00    193.5±0.42µs        ? ?/sec    1.00    193.2±0.29µs        ? ?/sec
store_retrieval_non_linear_kdtree/size_1000                  1.00   1147.0±2.80µs        ? ?/sec    1.00   1142.8±1.75µs        ? ?/sec
store_retrieval_non_linear_kdtree/size_10000                 1.00     11.3±0.01ms        ? ?/sec    1.02     11.6±0.15ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_100000                1.00    134.9±0.12ms        ? ?/sec    1.00    135.6±0.07ms        ? ?/sec
store_sequential_insertion_without_predicates/size_100       1.00    273.7±0.54µs        ? ?/sec    1.00    272.3±0.20µs        ? ?/sec
store_sequential_insertion_without_predicates/size_1000      1.00      2.7±0.01ms        ? ?/sec    1.00      2.7±0.00ms        ? ?/sec
store_sequential_insertion_without_predicates/size_10000     1.00     26.9±0.18ms        ? ?/sec    1.00     26.9±0.12ms        ? ?/sec
store_sequential_insertion_without_predicates/size_100000    1.00    268.5±0.30ms        ? ?/sec    1.01    270.5±0.91ms        ? ?/sec

@jimezesinachi jimezesinachi marked this pull request as draft November 18, 2025 07:59
Copy link
Owner

@deven96 deven96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good I think this is huge progress already

Particularly liking the test setup to ensure quorum... Don't think those got exercised to the point where a snapshot was triggered though i.e we've had so many writes that we trigger an async snapshot savepoint

Both DB and AI CLIs now include cluster flags:

- `--cluster-enabled`
- `--raft-node-id`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is surfaced... Aren't these supposed to be implicitly assigned? i.e when you join a cluster you get a node ID?

Copy link
Collaborator Author

@jimezesinachi jimezesinachi Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yes. But we will have to set up some type of Raft config persistence that keeps track of which ID we gave to an ahnlich node on a cluster, so that it can rejoin the cluster if it ever got disconnected, and I was thinking maybe that's out of scope for ahnlich

- `--admin-addr`
- `--raft-storage` (`memory` or `rocksdb`)
- `--raft-data-dir` (required when `--raft-storage rocksdb`)
- `--raft-snapshot-logs`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we perhaps get short descriptions for these CLI arguments. I personally would like to know snapshot logs vs storage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think some of the names can be better, after I add the descriptions, you can look and see if we should change any of the names


### Read semantics

In cluster mode, read RPCs call `ensure_linearizable()`. Reads are leader-gated and followers can return `UNAVAILABLE`/forwarding-related errors.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting... so even reads come through the leader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment yes, as a way to ensure consistency before reads and no "stale" reads, but what do you think?

.expect("no workspace root");
let proto_dir = workspace_root.join("protos");

let cluster_admin = proto_dir.join("services/cluster_admin.proto");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensured that these don't get copied with the public services when we export SDKs for other languages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, will do so now. Is it okay if I put the new raft protos in a new location (inside the new replication crate), so they are localised to where they're used?

}

#[derive(Debug)]
struct MemLogStoreInner<C: RaftTypeConfig> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to not offer this at all.. clusters should inherently mean you want more reliability , so in-memory feels counter-intuitive for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I was adding it as a test option more than anything else, but we can remove it completely if necessary. Is removing it from the public selection/CLI API and just leaving it in for tests/mocking okay?

#[arg(long, requires = "cluster_enabled")]
pub raft_node_id: u64,
/// Raft internal address host:port
#[arg(long, requires = "cluster_enabled")]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both addresses should be optional

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here: #276 (comment)

/// Snapshot after N logs
#[arg(long, default_value_t = 1000, requires = "cluster_enabled")]
pub raft_snapshot_logs: u64,
/// Join existing cluster via admin addr host:port
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be necessary if we already have a specified admin_addr default where we try to first join or then create a cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it won't, but at the moment, the reasoning is that we don't specify admin_addr

directml = ["ort/directml"]

[dev-dependencies]
bincode.workspace = true
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace bincode from the workspace entirely as it isn't used elsewhere

Copy link
Collaborator Author

@jimezesinachi jimezesinachi Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll look into it

ClusterAdminServiceClient::connect(format!("http://{join}")).await
{
let _ = client
.add_learner(tonic::Request::new(AddLearnerRequest { node: Some(node) }))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is joining distinct from all the other tasks we are spawning here? And if so what happens if the join fails? Seems we are ignoring

Copy link
Collaborator Author

@jimezesinachi jimezesinachi Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the order is wrong. Joining is supposed to be tried first if passed, and if it fails, none of the others should spawn

}
let snap_guard = stores_snapshot.guard();
for (key, value) in stores_snapshot.iter(&snap_guard) {
let _ = self.stores.try_insert(key.clone(), value.clone(), &guard);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably not ignore these errors... i.e atleast aggressively error log

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll get on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants