Skip to content

Add daemon heartbeat capability registry#77

Merged
wauputr4 merged 4 commits into
mainfrom
agent/dimas/58006f76
Jun 7, 2026
Merged

Add daemon heartbeat capability registry#77
wauputr4 merged 4 commits into
mainfrom
agent/dimas/58006f76

Conversation

@wauputr4

@wauputr4 wauputr4 commented Jun 7, 2026

Copy link
Copy Markdown
Member

Summary

  • add daemon capability columns and admin listing fields for provider family, model ids, capacity, region/labels, pricing metadata, health, and disabled status
  • add /daemon/heartbeat plus optional capability payload on registration, with validation and atomic storage updates
  • add daemon run-loop heartbeat publishing and config fields for advertised capabilities
  • add eligibility helper/tests covering online, stale, disabled, and unhealthy daemon selection

Verification

  • git diff --check
  • not run: cargo fmt/test (cargo is not installed in this environment)

Co-authored-by: multica-agent <github@multica.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces daemon capabilities and a heartbeat mechanism to the Mizan API and daemon. It adds new database columns and indexes for tracking daemon node capabilities (such as provider family, model IDs, max concurrency, region, and health status), exposes a new /daemon/heartbeat endpoint, and updates the daemon to periodically report its status. Feedback on the changes highlights critical Postgres compatibility issues where INTEGER database columns are mapped to i64 in Rust, which will cause runtime decoding errors. Additionally, it is recommended to optimize the daemon node selection query by filtering out stale nodes directly in SQL using the last_seen_at index instead of filtering them in memory.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/mizan-api/src/daemon_nodes.rs Outdated
public_key: Option<String>,
status: String,
revoked: i64,
disabled: i64,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The 'disabled' column is defined as INTEGER in the migration, which maps to i32 in Postgres. Using i64 in DbDaemonNode will cause a runtime decoding error when running on Postgres. Changing the type to i32 ensures cross-database compatibility.

Suggested change
disabled: i64,
disabled: i32,

Comment thread crates/mizan-api/src/daemon_nodes.rs Outdated
last_seen_at: Option<String>,
provider_family: Option<String>,
model_ids_json: String,
max_concurrency: Option<i64>,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The 'max_concurrency' column is defined as INTEGER in the migration, which maps to i32 in Postgres. Using Option in DbDaemonNode will cause a runtime decoding error when running on Postgres. Changing the type to Option ensures cross-database compatibility.

Suggested change
max_concurrency: Option<i64>,
max_concurrency: Option<i32>,

public_key: row.public_key,
status: row.status,
revoked: is_enabled(row.revoked),
disabled: is_enabled(row.disabled),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since 'disabled' has been changed to i32 for Postgres compatibility, we should cast it to i64 when passing it to is_enabled to match its expected signature.

Suggested change
disabled: is_enabled(row.disabled),
disabled: is_enabled(row.disabled as i64),

Comment on lines +650 to +668
let cutoff = now_utc_epoch_seconds().saturating_sub(stale_after_seconds.max(1));
let rows = query_as::<_, (String, String, String, i64, String)>(&prepare_sql(
database_backend,
"SELECT id, provider_family, model_ids_json, max_concurrency, last_seen_at
FROM daemon_nodes
WHERE status = ?
AND revoked = 0
AND disabled = 0
AND health_status = ?
AND provider_family IS NOT NULL
AND max_concurrency IS NOT NULL
AND last_seen_at IS NOT NULL
ORDER BY last_seen_at DESC, created_at ASC",
))
.bind(STATUS_ACTIVE)
.bind(HEALTH_STATUS_HEALTHY)
.fetch_all(database)
.await
.map_err(|error| AppError::infrastructure(error.to_string()))?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The query currently fetches all active and healthy daemon nodes from the database and filters out stale ones in memory. Since 'last_seen_at' is indexed ('idx_daemon_nodes_last_seen_at'), we can significantly optimize this by filtering stale nodes directly in the SQL query. This avoids loading and parsing JSON for potentially hundreds or thousands of stale daemon nodes.

Additionally, 'max_concurrency' is defined as INTEGER in the migration, which maps to i32 in Postgres. Decoding it as i64 in query_as will cause a runtime type mismatch error on Postgres. Changing the tuple type to i32 resolves this compatibility issue.

    let cutoff = now_utc_epoch_seconds().saturating_sub(stale_after_seconds.max(1));
    let cutoff_str = cutoff.to_string();
    let rows = query_as::<_, (String, String, String, i32, String)>(&prepare_sql(
        database_backend,
        "SELECT id, provider_family, model_ids_json, max_concurrency, last_seen_at
         FROM daemon_nodes
         WHERE status = ?
           AND revoked = 0
           AND disabled = 0
           AND health_status = ?
           AND provider_family IS NOT NULL
           AND max_concurrency IS NOT NULL
           AND last_seen_at >= ?
         ORDER BY last_seen_at DESC, created_at ASC",
    ))
    .bind(STATUS_ACTIVE)
    .bind(HEALTH_STATUS_HEALTHY)
    .bind(cutoff_str)
    .fetch_all(database)
    .await
    .map_err(|error| AppError::infrastructure(error.to_string()))?;

Comment thread crates/mizan-api/src/daemon_nodes.rs Outdated
Comment on lines +671 to +674
let last_seen = parse_timestamp(&last_seen_at)?;
if last_seen < cutoff {
continue;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since stale nodes are now filtered directly in the SQL query, we can completely remove the in-memory 'last_seen < cutoff' check and the string-to-timestamp parsing, improving performance and simplifying the code.

Dimas and others added 3 commits June 7, 2026 12:07
Co-authored-by: multica-agent <github@multica.ai>
Co-authored-by: multica-agent <github@multica.ai>
Co-authored-by: multica-agent <github@multica.ai>
@wauputr4 wauputr4 merged commit 498ac2c into main Jun 7, 2026
1 check passed
@wauputr4 wauputr4 deleted the agent/dimas/58006f76 branch June 7, 2026 12:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b254d5ff4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pricing_metadata: self.pricing_metadata.clone(),
region: self.region.clone(),
labels: self.labels.clone(),
health_status: Some("healthy".to_owned()),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report degraded health when the provider is down

In the daemon run loop, every heartbeat stores health_status = healthy unconditionally, and select_eligible_daemon_node later treats any fresh row with that value as dispatchable. When the daemon process is alive but its configured local provider or health endpoint is unavailable, the control plane will keep selecting this node instead of excluding it as unhealthy, causing requests to be routed to broken capacity until the heartbeat stops becoming fresh.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant