fix(framework): Make RUNNING transition atomic across replicas#6792
fix(framework): Make RUNNING transition atomic across replicas#6792
Conversation
Signed-off-by: Patrick Foley <patrick@flower.ai>
There was a problem hiding this comment.
Pull request overview
This PR aims to make the STARTING -> RUNNING run-status transition safe under multi-replica contention in the SQL-based LinkState implementation, and adds concurrent tests to validate uniqueness/claim behavior across SQLite-backed replicas.
Changes:
- Add an atomic
UPDATE ... WHERE ... RETURNINGclaim path forStatus.RUNNINGtransitions inSqlLinkState.update_run_status. - Improve/standardize an error log message related to invalid
sub_statusvalues. - Add multi-threaded, multi-replica SQLite tests to ensure unique claiming of messages and the RUNNING transition.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
framework/py/flwr/server/superlink/linkstate/sql_linkstate.py |
Introduces an atomic update path for claiming STARTING -> RUNNING across replicas and adjusts an error message. |
framework/py/flwr/server/superlink/linkstate/linkstate_test.py |
Adds file-based SQLite concurrency tests to validate unique claiming under contention and atomic RUNNING transitions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
…n accidentally included from that specific branch.
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
| threads = [ | ||
| threading.Thread(target=claim_running, args=(0, state_0)), | ||
| threading.Thread(target=claim_running, args=(1, state_1)), | ||
| ] |
There was a problem hiding this comment.
Just curious, does this unit test really demonstrate horizontal scalability? It seems to verify thread safety (i.e., multiple threads with separate DB connections can query concurrently), but not necessarily how the system behaves under real load.
Since each RPC already runs in its own thread with its own DB connection (that was the case when we still use sqlite3, this test might not add much beyond that.
Would it make more sense to test this using multi-processing to better approximate real-world scaling?
There was a problem hiding this comment.
Btw, this is a very valuable unit test! We should definitely have it, but prolly not use multi-threading.
Signed-off-by: Patrick Foley <patrick@flower.ai>
Signed-off-by: Patrick Foley <patrick@flower.ai>
|
Resolved by #6877 |
Issue
Description
Related issues/PRs
Proposal
Explanation
Checklist
#contributions)Any other comments?