From e4f10fb2216955f0183ff706331b3e4016da5630 Mon Sep 17 00:00:00 2001
From: ieQu1 <99872536+ieQu1@users.noreply.github.com>
Date: Wed, 18 Feb 2026 17:37:16 +0100
Subject: [PATCH] Add EIP-0034: Raft metadata layer

---
 active/0034-raft-metadata-layer.md | 174 +++++++++++++++++++++++++++++
 1 file changed, 174 insertions(+)
 create mode 100644 active/0034-raft-metadata-layer.md

diff --git a/active/0034-raft-metadata-layer.md b/active/0034-raft-metadata-layer.md
new file mode 100644
index 0000000..481ee6b
--- /dev/null
+++ b/active/0034-raft-metadata-layer.md
@@ -0,0 +1,174 @@
+# An Example of EMQX Improvement Proposal
+
+## Changelog
+
+* 2026-02-18: @ieQu1 Initial version
+
+## Abstract
+
+Currently, meta-information about shards, sites and the Raft cluster is stored in Mnesia.
+This was a temporary design decision made at the early development stages of DS.
+
+This EIP describes a new metadata layer for the builtin Raft durable storage backend, that is meant to replace Mnesia.
+
+## Motivation
+
+Mnesia and DS use very different strategy in regard to clustering and data persistence.
+With rare exception of `local_data` tables, Mnesia replicas contain full copy of all data.
+This makes each individual node disposable: many cluster management and recovery operations involve dropping data on individual node(s), under the assumption that this operation doesn't compromise the global state.
+Generally speaking, Mnesia is much more forgiving, but also less reliable as a long-term storage.
+
+On the contrary, the durable storage
+1. Is meant as a long-term reliable storage for valuable customers' data.
+2. It is sharded, so each node in the cluster may store a portion of valuable data
+
+Because of such incompatible designs, Mnesia can't be used as a foundation for the durable storage.
+However, it doesn't exclude its use in the secondary roles, for example as a distributed cache.
+
+## Design
+
+The general idea of the EIP is about splitting Ra metadata management into tree logical components:
+
+1. Local metadata inventory based on `emqx_dsch` module.
+   Each node fully owns information about its shard replicas:
+   - The node is able to read and modify its metadata inventory even in case of complete network isolation
+   - These changes aren't lost in case of cluster joining or leaving
+   - Other nodes are unable to make direct changes to the nodes' inventories.
+
+2. Backplane protocol letting nodes make coordinated changes to their inventories.
+
+3. Change planners.
+   Coordinating major changes involving multiple shards or DBs, such as original shard allocation, cluster re-balancing or disaster recovery should be done via "planners".
+
+### Goals and constraints
+
+We expect that shard rebalancing and other complex metadata changes may be required as part of disaster recovery,
+in order to heal the cluster that had lost some nodes.
+As such, we cannot assume that all nodes will be healthy and operational during such operations.
+This adds a great deal of complexity to the design.
+
+### Background: plan-execute pattern
+
+I suggest to embrace "plan-execute" pattern for this task.
+It implies splitting the problem into three steps:
+
+1. Gathering the data (I/O is required, subject to network errors)
+2. Planning the changes (pure functional code, no I/O).
+   Planning stage produces a list of operations that each node should perform to converge to the desired end state.
+3. Execution stage where each node independently executes the actions planned at stage 2. (I/O is required)
+
+This approach is beneficial for several reasons:
+
+- The most complex parts of the algorithm can be moved to the planning stage, which is a pure function that can be tested extensively.
+
+- Flexibility.
+  There could be multiple planning functions for different situations, be it load rebalancing during normal operation or disaster recovery.
+
+- Fault tolerance.
+  With properly designed plan primitives, partial loss of nodes or communication between them doesn't threaten eventual convergence to the expected outcome.
+  Since "plan" is a materialized list of CRUD actions stored on the node, it is possible to track its execution or safely cancel it in its entirety.
+
+Constraints:
+
+1. Only one plan involving the DB can be executed at the same time on the site.
+2. Plans for different DBs can run in parallel
+
+### Plan primitives
+
+1. `{schedule, DB, PID, Prio, [plan_primitive()]}`.
+   Accept a new plan with identifier `PID` for `DB`.
+
+   If there's already a plan in execution, and its priority is greater or equal to `Prio` then the schedule command is ignored.
+   If `Prio` is greater then priority of the existing plan, then the command overrides it.
+
+   Plans created by the operator, as a result of CLI command or REST API call should take higher priority than any automation.
+
+2. `{add_replica, DB, Shard, Sites}`.
+   Add a local replica of the shard, using `Sites` as the upstream.
+
+3. `{handover_replica, DB, Shard, Site}`.
+   Delete a local replica of the shard as soon as `Site` has the replica.
+
+   This operation is meant to safely delete a replica without compromising the replication factor.
+
+4. `{remove_replica, DB, Shard}`.
+   Unconditionally remove a local replica.
+
+### Planners
+
+This section lists possible scenarios where creation and execution of the plan can be triggered.
+
+1. Opening of the new DB.
+   This use case is possibly the most complex scenario, since in a new cluster nodes may attempt to execute this function simultaneously.
+
+2. Nodes joining or leaving the cluster.
+
+3. Operator commands
+
+Planner functions can take any information from the nodes into account.
+They are free to use any method of synchronization.
+
+### Plan execution
+
+`emqx_dsch` pending mechanism will be used for plan execution:
+
+```erlang
+Plan = [Cmd1, Cmd2, ...],
+emqx_dsch:add_pending({db, DB}, execute_plan, Plan)
+```
+
+The entire scheduled plan should be mapped to a single pending action.
+Advancing steps of the plan involve removing the existing action and inserting back its tail (if present).
+
+`emqx_dsch` module should support the following new operations for managing the plans:
+
+1. Atomic swap of the existing pending plan with the new one, depending on the priorities.
+2. Atomic advancement of plan steps (swapping the list of planned actions with its tail upon completion of a step)
+
+Internally, these actions will be mapped to the existing schema operation primitives,
+but their execution will be serialized by the gen server.
+
+Proposed new `emqx_dsch` APIs:
+
+```erlang
+emqx_dsch:maybe_execute_plan(DB, Prio, [Action1, Action2, ...]).
+```
+
+```erlang
+emqx_dsch:replace_pending(pending_id(), Command, Data).
+```
+
+### Cache
+
+Since in the new design every node owns its own schema,
+CLI and REST API (as well as, possibly, planner functions) need a new way of getting the full cluster view.
+For this purpose it is acceptable to use Mnesia.
+
+## Configuration Changes
+
+TODO: There may be CLI and Rest API changes.
+
+## Backwards Compatibility
+
+There should be a migration procedure for importing old data from Mnesia to `dsch`.
+
+## Document Changes
+
+If there is any document change, give a brief description of it here.
+
+## Testing Suggestions
+
+TODO:
+The final implementation must include unit test or common test code. If some
+more tests such as integration test or benchmarking test that need to be done
+manually, list them here.
+
+## Declined Alternatives
+
+### Use Raft as a storage for metadata
+
+That design implies creation of a replicated state machine to hold the metadata of DBs and shards.
+While this design brings a commonly used replication algorithm to the table and reuses the components that are already in place,
+in other aspects (such as disaster recovery and cluster membership changes) it creates more problems than it solves.
+
+### CRDTs