Add daemon heartbeat capability registry#77
Conversation
Co-authored-by: multica-agent <github@multica.ai>
There was a problem hiding this comment.
Code Review
This pull request introduces daemon capabilities and a heartbeat mechanism to the Mizan API and daemon. It adds new database columns and indexes for tracking daemon node capabilities (such as provider family, model IDs, max concurrency, region, and health status), exposes a new /daemon/heartbeat endpoint, and updates the daemon to periodically report its status. Feedback on the changes highlights critical Postgres compatibility issues where INTEGER database columns are mapped to i64 in Rust, which will cause runtime decoding errors. Additionally, it is recommended to optimize the daemon node selection query by filtering out stale nodes directly in SQL using the last_seen_at index instead of filtering them in memory.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| public_key: Option<String>, | ||
| status: String, | ||
| revoked: i64, | ||
| disabled: i64, |
There was a problem hiding this comment.
| last_seen_at: Option<String>, | ||
| provider_family: Option<String>, | ||
| model_ids_json: String, | ||
| max_concurrency: Option<i64>, |
There was a problem hiding this comment.
The 'max_concurrency' column is defined as INTEGER in the migration, which maps to i32 in Postgres. Using Option in DbDaemonNode will cause a runtime decoding error when running on Postgres. Changing the type to Option ensures cross-database compatibility.
| max_concurrency: Option<i64>, | |
| max_concurrency: Option<i32>, |
| public_key: row.public_key, | ||
| status: row.status, | ||
| revoked: is_enabled(row.revoked), | ||
| disabled: is_enabled(row.disabled), |
| let cutoff = now_utc_epoch_seconds().saturating_sub(stale_after_seconds.max(1)); | ||
| let rows = query_as::<_, (String, String, String, i64, String)>(&prepare_sql( | ||
| database_backend, | ||
| "SELECT id, provider_family, model_ids_json, max_concurrency, last_seen_at | ||
| FROM daemon_nodes | ||
| WHERE status = ? | ||
| AND revoked = 0 | ||
| AND disabled = 0 | ||
| AND health_status = ? | ||
| AND provider_family IS NOT NULL | ||
| AND max_concurrency IS NOT NULL | ||
| AND last_seen_at IS NOT NULL | ||
| ORDER BY last_seen_at DESC, created_at ASC", | ||
| )) | ||
| .bind(STATUS_ACTIVE) | ||
| .bind(HEALTH_STATUS_HEALTHY) | ||
| .fetch_all(database) | ||
| .await | ||
| .map_err(|error| AppError::infrastructure(error.to_string()))?; |
There was a problem hiding this comment.
The query currently fetches all active and healthy daemon nodes from the database and filters out stale ones in memory. Since 'last_seen_at' is indexed ('idx_daemon_nodes_last_seen_at'), we can significantly optimize this by filtering stale nodes directly in the SQL query. This avoids loading and parsing JSON for potentially hundreds or thousands of stale daemon nodes.
Additionally, 'max_concurrency' is defined as INTEGER in the migration, which maps to i32 in Postgres. Decoding it as i64 in query_as will cause a runtime type mismatch error on Postgres. Changing the tuple type to i32 resolves this compatibility issue.
let cutoff = now_utc_epoch_seconds().saturating_sub(stale_after_seconds.max(1));
let cutoff_str = cutoff.to_string();
let rows = query_as::<_, (String, String, String, i32, String)>(&prepare_sql(
database_backend,
"SELECT id, provider_family, model_ids_json, max_concurrency, last_seen_at
FROM daemon_nodes
WHERE status = ?
AND revoked = 0
AND disabled = 0
AND health_status = ?
AND provider_family IS NOT NULL
AND max_concurrency IS NOT NULL
AND last_seen_at >= ?
ORDER BY last_seen_at DESC, created_at ASC",
))
.bind(STATUS_ACTIVE)
.bind(HEALTH_STATUS_HEALTHY)
.bind(cutoff_str)
.fetch_all(database)
.await
.map_err(|error| AppError::infrastructure(error.to_string()))?;| let last_seen = parse_timestamp(&last_seen_at)?; | ||
| if last_seen < cutoff { | ||
| continue; | ||
| } |
Co-authored-by: multica-agent <github@multica.ai>
Co-authored-by: multica-agent <github@multica.ai>
Co-authored-by: multica-agent <github@multica.ai>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b254d5ff4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pricing_metadata: self.pricing_metadata.clone(), | ||
| region: self.region.clone(), | ||
| labels: self.labels.clone(), | ||
| health_status: Some("healthy".to_owned()), |
There was a problem hiding this comment.
Report degraded health when the provider is down
In the daemon run loop, every heartbeat stores health_status = healthy unconditionally, and select_eligible_daemon_node later treats any fresh row with that value as dispatchable. When the daemon process is alive but its configured local provider or health endpoint is unavailable, the control plane will keep selecting this node instead of excluding it as unhealthy, causing requests to be routed to broken capacity until the heartbeat stops becoming fresh.
Useful? React with 👍 / 👎.
Summary
Verification