draft: RFC for Horizontal scalability for ahnlich#276
draft: RFC for Horizontal scalability for ahnlich#276jimezesinachi wants to merge 7 commits intomainfrom
Horizontal scalability for ahnlich#276Conversation
Horizontal scalability for ahnlich
Benchmark Results |
Signed-off-by: Jim Ezesinachi <ezesinachijim@gmail.com>
Signed-off-by: Jim Ezesinachi <ezesinachijim@gmail.com>
deven96
left a comment
There was a problem hiding this comment.
So far so good I think this is huge progress already
Particularly liking the test setup to ensure quorum... Don't think those got exercised to the point where a snapshot was triggered though i.e we've had so many writes that we trigger an async snapshot savepoint
| Both DB and AI CLIs now include cluster flags: | ||
|
|
||
| - `--cluster-enabled` | ||
| - `--raft-node-id` |
There was a problem hiding this comment.
Curious why this is surfaced... Aren't these supposed to be implicitly assigned? i.e when you join a cluster you get a node ID?
There was a problem hiding this comment.
In theory, yes. But we will have to set up some type of Raft config persistence that keeps track of which ID we gave to an ahnlich node on a cluster, so that it can rejoin the cluster if it ever got disconnected, and I was thinking maybe that's out of scope for ahnlich
| - `--admin-addr` | ||
| - `--raft-storage` (`memory` or `rocksdb`) | ||
| - `--raft-data-dir` (required when `--raft-storage rocksdb`) | ||
| - `--raft-snapshot-logs` |
There was a problem hiding this comment.
Could we perhaps get short descriptions for these CLI arguments. I personally would like to know snapshot logs vs storage
There was a problem hiding this comment.
Makes sense. I think some of the names can be better, after I add the descriptions, you can look and see if we should change any of the names
|
|
||
| ### Read semantics | ||
|
|
||
| In cluster mode, read RPCs call `ensure_linearizable()`. Reads are leader-gated and followers can return `UNAVAILABLE`/forwarding-related errors. |
There was a problem hiding this comment.
Ah interesting... so even reads come through the leader?
There was a problem hiding this comment.
At the moment yes, as a way to ensure consistency before reads and no "stale" reads, but what do you think?
| .expect("no workspace root"); | ||
| let proto_dir = workspace_root.join("protos"); | ||
|
|
||
| let cluster_admin = proto_dir.join("services/cluster_admin.proto"); |
There was a problem hiding this comment.
Ensured that these don't get copied with the public services when we export SDKs for other languages?
There was a problem hiding this comment.
Not yet, will do so now. Is it okay if I put the new raft protos in a new location (inside the new replication crate), so they are localised to where they're used?
| } | ||
|
|
||
| #[derive(Debug)] | ||
| struct MemLogStoreInner<C: RaftTypeConfig> { |
There was a problem hiding this comment.
I think it's fine to not offer this at all.. clusters should inherently mean you want more reliability , so in-memory feels counter-intuitive for this
There was a problem hiding this comment.
Makes sense. I was adding it as a test option more than anything else, but we can remove it completely if necessary. Is removing it from the public selection/CLI API and just leaving it in for tests/mocking okay?
| #[arg(long, requires = "cluster_enabled")] | ||
| pub raft_node_id: u64, | ||
| /// Raft internal address host:port | ||
| #[arg(long, requires = "cluster_enabled")] |
There was a problem hiding this comment.
I think both addresses should be optional
| /// Snapshot after N logs | ||
| #[arg(long, default_value_t = 1000, requires = "cluster_enabled")] | ||
| pub raft_snapshot_logs: u64, | ||
| /// Join existing cluster via admin addr host:port |
There was a problem hiding this comment.
Would this be necessary if we already have a specified admin_addr default where we try to first join or then create a cluster?
There was a problem hiding this comment.
Hmmm, it won't, but at the moment, the reasoning is that we don't specify admin_addr
| directml = ["ort/directml"] | ||
|
|
||
| [dev-dependencies] | ||
| bincode.workspace = true |
There was a problem hiding this comment.
We should replace bincode from the workspace entirely as it isn't used elsewhere
There was a problem hiding this comment.
Makes sense, I'll look into it
| ClusterAdminServiceClient::connect(format!("http://{join}")).await | ||
| { | ||
| let _ = client | ||
| .add_learner(tonic::Request::new(AddLearnerRequest { node: Some(node) })) |
There was a problem hiding this comment.
Is joining distinct from all the other tasks we are spawning here? And if so what happens if the join fails? Seems we are ignoring
There was a problem hiding this comment.
Good catch, the order is wrong. Joining is supposed to be tried first if passed, and if it fails, none of the others should spawn
| } | ||
| let snap_guard = stores_snapshot.guard(); | ||
| for (key, value) in stores_snapshot.iter(&snap_guard) { | ||
| let _ = self.stores.try_insert(key.clone(), value.clone(), &guard); |
There was a problem hiding this comment.
We should probably not ignore these errors... i.e atleast aggressively error log
There was a problem hiding this comment.
Makes sense, I'll get on it
Issue - Horizontal scalability for ahnlich