Skip to content

feat: support storing secrets/credentials in Postgres#2665

Draft
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:carbide-postgres-secrets
Draft

feat: support storing secrets/credentials in Postgres#2665
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:carbide-postgres-secrets

Conversation

@chet

@chet chet commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

This supports #353.

This introduces a new secrets/credentials storage backend of sorts, such that NICo can now keep its credentials in Postgres instead of Vault, encrypted (with envelope encryption, leaning on #939) so that a copy of the database on its own gives up nothing. This entirely feature/flagged and configurable, currently not enabled at all, but will let us:

  • Use it alongside Vault.
  • Use it to migrate away from Vault.

The encryption is layered. Every credential gets its own data encryption key (DEK), and that DEK is wrapped by a key encryption key (KEK) that lives outside the database -- in NICo itself or whatever KmsProvider we choose. The credential's path is mixed into the encryption as well. The KEK is the only thing an operator has to protect; everything else lives in Postgres (some reference here and here.

PostgresCredentialManager implements the same CredentialManager traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which the rotation work going on in #367 can use for history, rollback, rotation, etc.

Where the wrapping keys come from is pluggable through KmsBackend: local key material (integrated), or Vault/OpenBao Transit secrets engine API, and more than one provider at once. [secrets.routing] maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails at startup.

Moving an existing site off Vault could be:

  • A one-time import at startup -- deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with [secrets] set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on SecretsConfig.
  • A slow migration -- we would introduce it as the primary CredentialWriter, such that all new credentials get written to Postgres, with Vault still in the ChainedCredentialReader path, allowing us to slowly migrate over.

BUT, migration is NOT a part of this PR. This PR is for getting some initial code in place, and then we will iterate on it, and then do subsequent work for migration planning.

Rotating a KEK is a config change plus one command: point the route at the new key and run carbide-admin-cli secrets re-wrap. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove.

Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a4792a5a-f038-4e24-ba24-72483ae3550d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR introduces a Postgres-backed secrets backend as an alternative to Vault KV. It adds AES-256-GCM envelope encryption with per-row DEKs wrapped by configurable KEKs, a prefix-based KEK routing layer, a one-time Vault-to-Postgres import flow, a re_wrap_stale operation to rotate DEKs to the currently active KEK, supporting KMS provider updates (AAD, kek_ids, Transit token renewal refactor), a gRPC ReWrapSecrets RPC, and an admin-cli secrets re-wrap subcommand.

Changes

Postgres Secrets Backend with KMS, Re-wrap, and Admin CLI

Layer / File(s) Summary
SecretId UUID, SecretRow model, DB schema, and DB access layer
crates/uuid/src/secret.rs, crates/uuid/src/lib.rs, crates/api-model/src/secrets.rs, crates/api-model/src/lib.rs, crates/api-db/migrations/20260416000305_create_secrets_table.sql, crates/api-db/src/secrets.rs, crates/api-db/src/lib.rs
SecretId typed UUID and SecretRow SQLx struct are defined. The secrets migration creates an append-only journal table with envelope-encryption columns and two indexes. The DB access layer adds all read, write, locking, pagination, and DEK-rotation functions.
KMS provider: AAD crypto, kek_ids trait, KeySource, transit cancellation
crates/kms-provider/src/crypto.rs, crates/kms-provider/src/lib.rs, crates/kms-provider/src/providers/integrated.rs, crates/kms-provider/src/providers/multi.rs, crates/kms-provider/src/providers/transit.rs, crates/kms-provider/Cargo.toml
encrypt/decrypt gain an aad: &[u8] parameter for AES-256-GCM authenticated data. KeySource::Value is removed; only Env and File remain. KmsBackend gains a required kek_ids() method implemented by all providers. TransitKmsProvider::start_token_renewal is replaced by a cancellable run_token_renewal future; sensitive buffers are wrapped in Zeroizing.
Vault client EnumerationMode and strict import helpers
crates/secrets/src/forge_vault.rs, crates/secrets/src/lib.rs
EnumerationMode (BestEffort/Strict) is added to Vault listing and reading. get_secrets_strict enumerates the full KV mount with strict error semantics. create_raw_vault_client_settings constructs a standalone VaultClientSettings for the Transit provider.
SecretRouting, PostgresCredentialManager, envelope encryption, metrics
crates/api-core/src/secrets/routing.rs, crates/api-core/src/secrets/mod.rs, crates/api-core/src/secrets/metrics.rs, crates/api-core/src/secrets/import.rs, crates/api-core/src/secrets/re_wrap.rs, crates/api-core/src/lib.rs
SecretRouting maps path prefixes to KEK IDs using longest-prefix-first matching with normalization and validation. PostgresCredentialManager implements CredentialReader/CredentialWriter with envelope encryption (path-bound AAD) and Vault tombstone semantics. SecretsMetrics/OperationTimer instrument operations via OpenTelemetry. import_secrets supports MissingOnly and All strategies. re_wrap_stale rotates DEKs in batches under a Postgres advisory lock.
SecretRouting unit tests
crates/api-core/src/secrets/routing.rs
Tests cover longest-prefix selection, catch-all matching, whole-segment prefix safety, and all from_config validation error cases.
SecretsConfig types, run.rs startup wiring, setup.rs refactor, RBAC
crates/api-core/src/cfg/file.rs, crates/api-core/src/run.rs, crates/api-core/src/setup.rs, crates/api-core/src/auth/internal_rbac_rules.rs, crates/api-core/src/test_support/builder.rs, crates/api-core/src/test_support/default_config.rs, crates/api-core/Cargo.toml
CarbideConfig gains an optional [secrets] section with SecretsConfig, KmsConfig, and ProviderConfig types. run.rs conditionally constructs a Postgres credential manager via build_kms_backend and import_vault_secrets_once, or falls back to the existing Vault/in-memory path. start_api receives a pre-built PgPool and SecretsContext. ReWrapSecrets is added to RBAC rules.
Proto RPC, Api struct, gRPC handler
crates/rpc/proto/forge.proto, crates/api-core/src/api.rs, crates/api-core/src/handlers/secrets.rs, crates/api-core/src/handlers/mod.rs
ReWrapSecrets RPC is added to the Forge service with request/response messages. Api gains a secrets_context field. The re_wrap_secrets handler clamps batch_size, checks the secrets context precondition, invokes re_wrap_stale, and returns counts.
Admin CLI secrets command
crates/admin-cli/src/cfg/cli_options.rs, crates/admin-cli/src/main.rs, crates/admin-cli/src/secrets/mod.rs, crates/admin-cli/src/secrets/re_wrap/...
A secrets top-level subcommand is added to CliCommand. The re-wrap subcommand accepts --batch-size, calls the ReWrapSecrets RPC, and prints re-wrapped vs already-current counts and stale-row status.
Unit and integration tests
crates/api-core/src/secrets/tests.rs, crates/api-integration-tests/tests/secrets_import.rs, crates/api-integration-tests/Cargo.toml
DB-backed tests cover journal semantics, create-only, delete, tie-breaking, rotation rollback, tombstone, AAD binding, and re_wrap_stale correctness. The integration test exercises the full Vault-to-Postgres import flow with round-trip verification, MissingOnly idempotency, and import completion marker.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as admin-cli
  participant API as api-core Forge gRPC
  participant Handler as handlers::secrets
  participant ReWrap as secrets::re_wrap_stale
  participant KMS as MultiKmsProvider
  participant DB as Postgres secrets table

  CLI->>API: ReWrapSecretsRequest { batch_size }
  API->>Handler: re_wrap_secrets(request)
  Handler->>Handler: clamp batch_size to [1, MAX]
  Handler->>ReWrap: re_wrap_stale(pool, kms, routing, batch_size)
  ReWrap->>DB: try_lock_path (advisory lock)
  loop each batch
    ReWrap->>DB: find_batch_after(after_seq, limit)
    DB-->>ReWrap: Vec~SecretRow~
    loop each row
      ReWrap->>KMS: decrypt_dek(row.kek_id, row.encrypted_dek)
      KMS-->>ReWrap: dek
      ReWrap->>KMS: encrypt_dek(new_kek_id, dek)
      KMS-->>ReWrap: new_encrypted_dek
    end
    ReWrap->>DB: update_dek_wrap (batch transaction)
  end
  ReWrap->>DB: count_wrapped_outside(routing_kek_ids)
  DB-->>ReWrap: stale_remaining
  ReWrap-->>Handler: ReWrapStaleResult
  Handler-->>API: ReWrapSecretsResponse
  API-->>CLI: { re_wrapped, already_current, stale_remaining }
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the primary change: introducing Postgres as a credential storage backend, which is the central feature of this substantial changeset.
Description check ✅ Passed The description is comprehensive and directly relevant to the changeset, detailing the feature's purpose, architecture, encryption approach, configuration, and testing coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chet

chet commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thanks!

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@chet: Absolutely, I will conduct a thorough review of the PR now. Thank you for the ping.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chet chet requested a review from wminckler June 16, 2026 20:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (9)
crates/api-db/src/secrets.rs (1)

124-130: ⚡ Quick win

Use query_scalar for single-boolean reads in try_lock_path and exists.

Both queries return exactly one scalar bool; query_scalar is clearer and removes tuple unpacking.

Proposed refactor
 pub async fn try_lock_path(txn: &mut PgConnection, path: &str) -> DatabaseResult<bool> {
     let sql = "SELECT pg_try_advisory_xact_lock(hashtextextended('secrets:' || $1, 0))";
-    let row: (bool,) = sqlx::query_as(sql)
+    let row: bool = sqlx::query_scalar(sql)
         .bind(path)
         .fetch_one(txn)
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(row.0)
+    Ok(row)
 }
 
 /// Whether any entries exist for the path.
 pub async fn exists(txn: impl DbReader<'_>, path: &str) -> DatabaseResult<bool> {
     let sql = "SELECT EXISTS(SELECT 1 FROM secrets WHERE path = $1)";
-    let row: (bool,) = sqlx::query_as(sql)
+    let row: bool = sqlx::query_scalar(sql)
         .bind(path)
         .fetch_one(txn)
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(row.0)
+    Ok(row)
 }

Based on learnings: in crates/api-db, prefer sqlx::query_scalar(query).bind(...).fetch_one(...) when fetching exactly one scalar value.

Also applies to: 135-141

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-db/src/secrets.rs` around lines 124 - 130, Replace the query
pattern in both the `try_lock_path` and `exists` functions that use
`sqlx::query_as` to fetch a single boolean value wrapped in a tuple. Instead of
using `query_as` with tuple unpacking, use `sqlx::query_scalar` which directly
fetches the scalar boolean value without requiring tuple unpacking. This makes
the code clearer and more idiomatic for single scalar value queries.

Source: Learnings

crates/admin-cli/src/secrets/re_wrap/args.rs (1)

20-27: ⚡ Quick win

Add worked EXAMPLES: help to the re-wrap args struct.

Line 21 defines a new operator-facing command args struct but omits the required #[command(after_long_help = "...")] examples block.

Suggested patch
 #[derive(Parser, Debug, Clone)]
+#[command(
+    after_long_help = "EXAMPLES:\n\
+\n\
+Run re-wrap with server defaults:\n\
+    $ nico-admin-cli secrets re-wrap\n\
+\n\
+Run re-wrap with an explicit batch size:\n\
+    $ nico-admin-cli secrets re-wrap --batch-size 1000\n\
+\n"
+)]
 pub struct Args {
     #[clap(
         long,
         help = "Rows scanned per batch during the walk. The server applies its own default and limits."
     )]
     pub batch_size: Option<u32>,
 }

As per coding guidelines, crates/admin-cli/**/*.rs should attach #[command(after_long_help = "...")] with a concrete EXAMPLES: section for realistic usage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/secrets/re_wrap/args.rs` around lines 20 - 27, The Args
struct in crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.

Source: Coding guidelines

crates/admin-cli/src/secrets/re_wrap/cmd.rs (1)

22-26: ⚡ Quick win

Wrap the RPC failure with operator-facing context.

Line 25 should add a context segment so failure chains read as a coherent “while attempting to …” story.

Suggested patch
 use crate::errors::CarbideCliResult;
+use eyre::WrapErr as _;
 
 use crate::rpc::ApiClient;
 
 pub async fn re_wrap(api_client: &ApiClient, batch_size: Option<u32>) -> CarbideCliResult<()> {
     let request = ::rpc::forge::ReWrapSecretsRequest { batch_size };
 
-    let resp = api_client.0.re_wrap_secrets(request).await?;
+    let resp = api_client
+        .0
+        .re_wrap_secrets(request)
+        .await
+        .wrap_err("while attempting to re-wrap stale secrets via Forge API")?;

As per coding guidelines, admin-cli error chains should include .context()/wrap_err() phrased “while attempting to …” for actionable operator troubleshooting.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs` around lines 22 - 26, The RPC
call to re_wrap_secrets in the re_wrap function does not wrap its error with
operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.

Source: Coding guidelines

crates/secrets/src/forge_vault.rs (3)

770-780: ⚡ Quick win

Derive Eq with PartialEq to keep the new enum clippy-clean.

EnumerationMode has only fieldless variants, so Eq is correct and avoids clippy noise in a warnings-as-errors workspace. As per coding guidelines, “we enable all clippy lints by default, and treat all warnings as errors.”

♻️ Proposed fix
-#[derive(Clone, Copy, PartialEq)]
+#[derive(Clone, Copy, PartialEq, Eq)]
 enum EnumerationMode {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 770 - 780, The
EnumerationMode enum derives Clone, Copy, and PartialEq, but since it only
contains fieldless variants, it should also derive Eq to satisfy clippy
requirements in the warnings-as-errors workspace. Add Eq to the derive attribute
list for the EnumerationMode enum to keep it clippy-clean and avoid warnings.

Source: Coding guidelines


1105-1108: ⚡ Quick win

Make the static-token requirement explicit in the returned error.

For Transit KMS on service-account-auth deployments, vault_config.token()? currently reports only VAULT_TOKEN. Add context so startup points directly at the unsupported auth mode instead of looking like a missing generic Vault setting.

♻️ Proposed fix
+    let token = vault_config.token().wrap_err(
+        "transit KMS requires vault.token or VAULT_TOKEN; Kubernetes service-account auth \
+         is not supported by create_raw_vault_client_settings yet",
+    )?;
     let mut builder = VaultClientSettingsBuilder::default();
     builder
-        .token(vault_config.token()?)
+        .token(token)
         .address(vault_config.address()?)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 1105 - 1108, The error
returned by vault_config.token()? when building the VaultClientSettingsBuilder
is too generic and doesn't clarify the root cause. Enhance the error handling
around the vault_config.token()? call to provide a more explicit error message
that indicates static-token authentication is required but unsupported for
Transit KMS on service-account-auth deployments, so users understand the
authentication mode limitation rather than treating it as a missing generic
Vault setting.

833-835: ⚡ Quick win

Log Vault enumeration errors as structured fields.

These new warnings interpolate e into the message, which makes error filtering harder. Put the error in an attribute so logfmt consumers can search by error, prefix, and path. As per coding guidelines, “prefer placing common fields as attributes passed to tracing function, instead of using string interpolation.”

♻️ Proposed fix
                     tracing::warn!(
                         prefix = %dir,
-                        "failed to list vault path: {e}"
+                        error = %e,
+                        "failed to list vault path"
                     );
                     tracing::warn!(
                         path = %path,
-                        "failed to read: {e}"
+                        error = %e,
+                        "failed to read vault secret"
                     );

Also applies to: 937-939

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 833 - 835, The warning logs
in the vault path enumeration are interpolating the error `e` directly into the
message string using `{e}`, which makes error filtering difficult for logfmt
consumers. Refactor the tracing::warn! calls to pass the error as a structured
attribute instead of including it in the message string. Remove the `{e}` from
the message and add `error = %e` (or similar attribute syntax) as a parameter to
the tracing::warn! macro, keeping the existing attributes like `prefix`. This
change needs to be applied at the location shown (lines 833-835 in the failed to
list vault path warning) and also at the similar warning around lines 937-939 as
mentioned in the comment.

Source: Coding guidelines

crates/secrets/src/lib.rs (1)

23-27: ⚡ Quick win

Update the re-export comment for the new Transit helper.

The block now exposes create_raw_vault_client_settings, which is not for CertificateProvider usage. Please broaden the comment so the public API intent remains accurate.

♻️ Proposed fix
-/// Exposed for `CertificateProvider` usage only. Credential operations should go
-/// through `create_credential_manager` instead of using the vault client directly.
+/// Exposed for direct Vault integrations such as `CertificateProvider` and
+/// Transit KMS setup. Credential operations should go through
+/// `create_credential_manager` instead of using the vault client directly.
 pub use crate::forge_vault::{
     ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, create_vault_client,
 };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/lib.rs` around lines 23 - 27, The comment above the
re-export block in crate::forge_vault states that the exports are only for
CertificateProvider usage, but this is no longer accurate now that
create_raw_vault_client_settings is included in the public API. Broaden the
comment to accurately describe the public API intent for all re-exported items
(ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, and
create_vault_client), making it clear that these are exposed for multiple
purposes beyond just CertificateProvider usage.
crates/api-core/Cargo.toml (1)

57-61: ⚡ Quick win

Keep the new internal dependency in alphabetized order.

carbide-kms-provider should sit before carbide-libmlx, not after carbide-rpc-utils, to match the manifest’s ordering contract.

♻️ Proposed fix
 carbide-ipmi = { path = "../ipmi" }
 carbide-ipxe-renderer = { path = "../ipxe-renderer" }
+carbide-kms-provider = { path = "../kms-provider" }
 carbide-libmlx = { path = "../libmlx" }
@@
 carbide-rpc = { path = "../rpc", features = ["sqlx", "model"] }
 carbide-rpc-utils = { path = "../rpc-utils" }
-carbide-kms-provider = { path = "../kms-provider" }
 carbide-secrets = { path = "../secrets" }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/Cargo.toml` around lines 57 - 61, The internal dependencies
in the Cargo.toml file are not in alphabetical order. Move the
carbide-kms-provider dependency to its correct alphabetical position, which
should be before carbide-redfish, not after carbide-rpc-utils. Reorder all the
dependencies shown (carbide-redfish, carbide-rpc, carbide-rpc-utils,
carbide-kms-provider, carbide-secrets) so they appear in strict alphabetical
order by their crate names.
crates/api-core/src/test_support/builder.rs (1)

51-65: ⚡ Quick win

Allow tests to build a secrets-enabled Api.

Hard-coding secrets_context: None means handler tests for the new secrets/re-wrap path cannot use TestApiBuilder. Add an optional builder field and with_secrets_context method so tests can exercise both enabled and disabled cases.

♻️ Proposed fix
 pub struct TestApiBuilder {
@@
     ib_fabric_manager: Option<Arc<dyn IBFabricManager>>,
     component_manager: Option<Arc<component_manager::component_manager::ComponentManager>>,
+    secrets_context: Option<crate::secrets::SecretsContext>,
 }
@@
             ib_fabric_manager: None,
             component_manager: None,
+            secrets_context: None,
         }
     }
@@
     pub fn with_component_manager(
         self,
         component_manager: Arc<component_manager::component_manager::ComponentManager>,
     ) -> Self {
@@
         }
     }
+
+    pub fn with_secrets_context(self, secrets_context: crate::secrets::SecretsContext) -> Self {
+        Self {
+            secrets_context: Some(secrets_context),
+            ..self
+        }
+    }
@@
-            secrets_context: None,
+            secrets_context: self.secrets_context,
         }
     }
 }

Also applies to: 73-87, 153-161, 254-254

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/test_support/builder.rs` around lines 51 - 65, The
TestApiBuilder struct currently hard-codes secrets_context to None, preventing
tests from building an Api with secrets enabled. Add a new optional field (of
appropriate secrets context type) to the TestApiBuilder struct definition,
implement a with_secrets_context builder method that allows setting this field,
and ensure the field is properly passed through and used when the Api is
constructed so that tests can exercise both enabled and disabled cases for the
secrets/re-wrap path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/admin-cli/src/cfg/cli_options.rs`:
- Around line 205-206: The new Secrets variant added to the CLI options enum in
cli_options.rs at lines 205–206 lacks a corresponding domain mapping entry in
cli_domains.yaml, which will cause the every_command_has_a_domain test to fail
during CI. Add a secrets entry under the appropriate domain list (admin or
tenant, based on the operational context of the secrets command) in
cli_domains.yaml to ensure the command is properly registered in the domain
configuration.

In `@crates/api-core/src/cfg/file.rs`:
- Around line 863-891: Add the `#[serde(deny_unknown_fields)]` attribute to the
SecretsConfig struct to reject misspelled configuration keys like `import_fom`
instead of silently ignoring them. Additionally, apply the same
`#[serde(deny_unknown_fields)]` attribute to the KmsConfig struct (located
around lines 903-910) for consistency. Finally, add a regression test that
verifies the configuration parser correctly rejects a misspelled optional field
(for example, `import_fom` instead of `import_from`) in the `[secrets]` section,
ensuring startup validation prevents such typos.

In `@crates/api-core/src/run.rs`:
- Around line 515-560: The code holds the transaction lock_txn (which contains a
pooled connection and advisory lock) while performing long-running operations
like vault_client.get_secrets_strict().await and
crate::secrets::import_secrets().await. This creates a deadlock risk during
concurrent replica startup where other waiters can block on the same lock while
occupying remaining pool connections. Restructure the code to use short
transactions: acquire the lock and check completion status in one short
transaction, then release it before calling
vault_client.get_secrets_strict().await and the import_secrets function, and
finally use another short transaction (reusing the same connection if possible,
or using a distinct lock key) to mark the import as complete. This applies at
the location shown (lines 515-560, specifically around the vault enumeration and
import_secrets calls) and also at lines 569-576 where similar long-running
operations may be awaited while holding a transaction.

In `@crates/api-core/src/secrets/re_wrap.rs`:
- Around line 72-88: The re-wrap operation holds a lock transaction from the
pool while the loop in `find_batch_after` and subsequent batch operations try to
acquire additional connections, causing potential deadlock under pool
saturation. Replace the transaction-based locking approach with a single
dedicated connection obtained from the pool at the start. Use a session advisory
lock (via db::secrets::try_lock_path) on that dedicated connection instead of a
transaction, then execute all scan and update operations on that same connection
throughout the loop to avoid holding the transaction open across awaited work.
Apply the same fix at the second location mentioned (lines 126-129) where
similar transaction holding occurs.

In `@crates/api-core/src/secrets/routing.rs`:
- Around line 65-91: In the from_config method's validation loop (where prefix
and kek_id are validated), add a check to reject any prefix that starts with "/"
unless it matches the CATCH_ALL constant. This prevents prefixes like
"/machines/bmc" from being silently misrouted to the catch-all, which could
cause writes to be encrypted under the wrong KEK. Add this validation check
after the empty prefix check and before the normalized prefix processing.

In `@crates/api-db/migrations/20260416000305_create_secrets_table.sql`:
- Around line 21-22: The `seq` column in the migration file uses `BIGINT
GENERATED ALWAYS AS IDENTITY` but is missing a `UNIQUE` constraint. Since
PostgreSQL's GENERATED ALWAYS AS IDENTITY does not enforce uniqueness by itself,
duplicates can occur from sequence resets or manual insertions, breaking the
pagination contract used by functions like `find_batch_after` that rely on `seq
> $1` ordering. Add the `UNIQUE` constraint to the `seq` column definition in
the CREATE TABLE statement to ensure deterministic batch ordering and pagination
behavior.

In `@crates/api-integration-tests/tests/secrets_import.rs`:
- Around line 31-35: The allocate_port() function at lines 31-35 releases the
TcpListener immediately after binding, creating a race condition where another
process can claim the port before it's used at lines 237-238. Fix this by either
keeping the listener alive and passing it to the Vault startup code instead of
just the port number, or by implementing a retry mechanism that attempts to bind
to the port immediately before starting Vault (lines 237-238), retrying if the
port becomes unavailable due to race conditions. This ensures the port
reservation and Vault startup happen atomically without releasing the binding in
between.
- Around line 216-219: The call to db::migrations::MIGRATOR.run() uses .expect()
which causes a panic that bypasses the cleanup logic if migrations fail, leaking
temporary databases. Replace the .expect() call with a fallible pattern using
the ? operator, and restructure the code to ensure the drop/cleanup logic
(currently at lines 225-228) always executes after setup and exercise,
regardless of whether migrations succeed or fail. This typically means using a
guard pattern or ensuring cleanup runs in a finally-like block that is
guaranteed to execute even when earlier operations return errors.
- Around line 212-214: The string concatenation used to build test_db_url at
line 212 breaks when the original db_url contains a database path or query
parameters. Instead of using format!("{db_url}/{test_db_name}"), parse the
db_url into its connection components using the appropriate PostgreSQL URL
parsing utilities, modify the database name field in the parsed options to use
test_db_name, and then reconstruct the connection URL from the parsed options.
This ensures existing query parameters and path components in the original URL
are preserved when building the test database connection string.

In `@crates/rpc/proto/forge.proto`:
- Around line 8550-8554: The `batch_size` field in the `ReWrapSecretsRequest`
message lacks documentation about its validation contract. Update the comment
for the `batch_size` field to explicitly document the semantics for when the
field is unset (its default behavior), what happens if zero is provided, and any
upper-bound constraints the server enforces. This ensures that generated clients
understand the expected validation rules and can properly handle edge cases.
- Around line 946-947: In the crates/rpc/build.rs file, add explicit serde
Serialize annotations for the ReWrapSecretsRequest and ReWrapSecretsResponse
message types using the .type_attribute() method calls, similar to how
comparable gRPC Request/Response types like ListOsImageRequest and
GetBmcCredentialsRequest are configured. This will ensure that these types have
the necessary Serialize trait implementation required by the log_request_data()
function called in the handler at crates/api-core/src/handlers/secrets.rs,
enabling proper request logging and serialization.

---

Nitpick comments:
In `@crates/admin-cli/src/secrets/re_wrap/args.rs`:
- Around line 20-27: The Args struct in
crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.

In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs`:
- Around line 22-26: The RPC call to re_wrap_secrets in the re_wrap function
does not wrap its error with operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.

In `@crates/api-core/Cargo.toml`:
- Around line 57-61: The internal dependencies in the Cargo.toml file are not in
alphabetical order. Move the carbide-kms-provider dependency to its correct
alphabetical position, which should be before carbide-redfish, not after
carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish,
carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they
appear in strict alphabetical order by their crate names.

In `@crates/api-core/src/test_support/builder.rs`:
- Around line 51-65: The TestApiBuilder struct currently hard-codes
secrets_context to None, preventing tests from building an Api with secrets
enabled. Add a new optional field (of appropriate secrets context type) to the
TestApiBuilder struct definition, implement a with_secrets_context builder
method that allows setting this field, and ensure the field is properly passed
through and used when the Api is constructed so that tests can exercise both
enabled and disabled cases for the secrets/re-wrap path.

In `@crates/api-db/src/secrets.rs`:
- Around line 124-130: Replace the query pattern in both the `try_lock_path` and
`exists` functions that use `sqlx::query_as` to fetch a single boolean value
wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use
`sqlx::query_scalar` which directly fetches the scalar boolean value without
requiring tuple unpacking. This makes the code clearer and more idiomatic for
single scalar value queries.

In `@crates/secrets/src/forge_vault.rs`:
- Around line 770-780: The EnumerationMode enum derives Clone, Copy, and
PartialEq, but since it only contains fieldless variants, it should also derive
Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to
the derive attribute list for the EnumerationMode enum to keep it clippy-clean
and avoid warnings.
- Around line 1105-1108: The error returned by vault_config.token()? when
building the VaultClientSettingsBuilder is too generic and doesn't clarify the
root cause. Enhance the error handling around the vault_config.token()? call to
provide a more explicit error message that indicates static-token authentication
is required but unsupported for Transit KMS on service-account-auth deployments,
so users understand the authentication mode limitation rather than treating it
as a missing generic Vault setting.
- Around line 833-835: The warning logs in the vault path enumeration are
interpolating the error `e` directly into the message string using `{e}`, which
makes error filtering difficult for logfmt consumers. Refactor the
tracing::warn! calls to pass the error as a structured attribute instead of
including it in the message string. Remove the `{e}` from the message and add
`error = %e` (or similar attribute syntax) as a parameter to the tracing::warn!
macro, keeping the existing attributes like `prefix`. This change needs to be
applied at the location shown (lines 833-835 in the failed to list vault path
warning) and also at the similar warning around lines 937-939 as mentioned in
the comment.

In `@crates/secrets/src/lib.rs`:
- Around line 23-27: The comment above the re-export block in crate::forge_vault
states that the exports are only for CertificateProvider usage, but this is no
longer accurate now that create_raw_vault_client_settings is included in the
public API. Broaden the comment to accurately describe the public API intent for
all re-exported items (ForgeVaultClient, VaultConfig,
create_raw_vault_client_settings, and create_vault_client), making it clear that
these are exposed for multiple purposes beyond just CertificateProvider usage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9bd7cccb-fd3d-47e1-9d17-4ccbe5fee75a

📥 Commits

Reviewing files that changed from the base of the PR and between 7eb1969 and b171e20.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (41)
  • crates/admin-cli/src/cfg/cli_options.rs
  • crates/admin-cli/src/main.rs
  • crates/admin-cli/src/secrets/mod.rs
  • crates/admin-cli/src/secrets/re_wrap/args.rs
  • crates/admin-cli/src/secrets/re_wrap/cmd.rs
  • crates/admin-cli/src/secrets/re_wrap/mod.rs
  • crates/api-core/Cargo.toml
  • crates/api-core/src/api.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/handlers/mod.rs
  • crates/api-core/src/handlers/secrets.rs
  • crates/api-core/src/lib.rs
  • crates/api-core/src/run.rs
  • crates/api-core/src/secrets/import.rs
  • crates/api-core/src/secrets/metrics.rs
  • crates/api-core/src/secrets/mod.rs
  • crates/api-core/src/secrets/re_wrap.rs
  • crates/api-core/src/secrets/routing.rs
  • crates/api-core/src/secrets/tests.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-db/migrations/20260416000305_create_secrets_table.sql
  • crates/api-db/src/lib.rs
  • crates/api-db/src/secrets.rs
  • crates/api-integration-tests/Cargo.toml
  • crates/api-integration-tests/tests/secrets_import.rs
  • crates/api-model/src/lib.rs
  • crates/api-model/src/secrets.rs
  • crates/kms-provider/Cargo.toml
  • crates/kms-provider/src/crypto.rs
  • crates/kms-provider/src/lib.rs
  • crates/kms-provider/src/providers/integrated.rs
  • crates/kms-provider/src/providers/multi.rs
  • crates/kms-provider/src/providers/transit.rs
  • crates/rpc/proto/forge.proto
  • crates/secrets/src/forge_vault.rs
  • crates/secrets/src/lib.rs
  • crates/uuid/src/lib.rs
  • crates/uuid/src/secret.rs

Comment thread crates/admin-cli/src/cfg/cli_options.rs
Comment thread crates/api-core/src/cfg/file.rs
Comment thread crates/api-core/src/run.rs Outdated
Comment thread crates/api-core/src/secrets/re_wrap.rs Outdated
Comment thread crates/api-core/src/secrets/routing.rs
Comment on lines +31 to +35
/// allocate_port picks a free port by binding to port 0.
fn allocate_port() -> SocketAddr {
let listener = TcpListener::bind("127.0.0.1:0").expect("bind to free port");
listener.local_addr().expect("local addr")
}

@coderabbitai coderabbitai Bot Jun 16, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Port reservation is racy and can cause intermittent test failures.

Line 31-35 releases the listener before Line 238 starts Vault, so another process can claim the port in between. This is a classic TOCTOU race and a common CI flake source for integration tests. Prefer a helper that binds and starts atomically (or retries on bind conflicts).

As per coding guidelines, integration tests should be deterministic and avoid flaky behavior.

Also applies to: 237-238

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-integration-tests/tests/secrets_import.rs` around lines 31 - 35,
The allocate_port() function at lines 31-35 releases the TcpListener immediately
after binding, creating a race condition where another process can claim the
port before it's used at lines 237-238. Fix this by either keeping the listener
alive and passing it to the Vault startup code instead of just the port number,
or by implementing a retry mechanism that attempts to bind to the port
immediately before starting Vault (lines 237-238), retrying if the port becomes
unavailable due to race conditions. This ensures the port reservation and Vault
startup happen atomically without releasing the binding in between.

Source: Coding guidelines

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai Fair -- it is a real TOCTOU, but the atomic bind-and-launch would have to live in how api_test_helper spawns the external Vault process, not in this test, so it's out of scope for this PR. Leaving as-is. Thanks for checking!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chet, understood — the atomic bind-and-launch fix is appropriately scoped to the api_test_helper process spawner, not this test file. Happy to leave it as-is here.

Would you like me to open a follow-up GitHub issue to track the api_test_helper improvement so it is not lost?

Comment thread crates/api-integration-tests/tests/secrets_import.rs Outdated
Comment thread crates/api-integration-tests/tests/secrets_import.rs Outdated
Comment thread crates/rpc/proto/forge.proto
Comment thread crates/rpc/proto/forge.proto
Carbide can now keep its credentials in Postgres instead of Vault,
encrypted so that a copy of the database on its own gives up nothing.
This swaps out Vault KV as the credential store; Vault still issues the
PKI certificates.

The encryption is layered. Every credential gets its own data key, and
that data key is wrapped by a key encryption key (KEK) that lives
outside the database -- in carbide's own config or an external KMS. The
credential's path is mixed into the encryption as well, so a row lifted
out and dropped onto another path will not decrypt. The KEK is the only
thing an operator has to protect; everything else lives in Postgres.

`PostgresCredentialManager` implements the same `CredentialManager`
traits the rest of the system already uses, so nothing downstream had to
change. It keeps the two reader behaviors callers learned from Vault:
the newest write for a path wins, and an empty-password entry reads as
"no credential" (several delete paths still record that tombstone). The
store is an append-only journal -- one row per write, newest-per-path on
read -- which is what the upcoming rotation work uses for history and
rollback.

Where the wrapping keys come from is pluggable through `KmsBackend`:
local key material (`integrated`), or Vault/OpenBao Transit, and more
than one provider at once. `[secrets.routing]` maps path prefixes to the
KEK that encrypts new writes under them. Startup cross-checks the routing
against the providers, so a misspelled, duplicated, or colliding key
fails the boot instead of some write hours later.

Moving an existing site off Vault is a one-time import at startup, and it
is deliberately all-or-nothing: any list or read failure, or an empty
listing, aborts the boot rather than recording a half-finished import as
done, and only one replica runs it at a time behind a Postgres advisory
lock. Once it completes, Vault is out of the credential chain entirely --
with `[secrets]` set, the chain is env -> file -> postgres and nothing
falls back to Vault. Prerequisites that live outside this process
(services that still read Vault directly, mixed-fleet rolling upgrades)
are spelled out on `SecretsConfig`.

Rotating a KEK is a config change plus one command: point the route at
the new key and run `carbide-admin-cli secrets re-wrap`. Only the
per-row data-key wrapping is redone; the encrypted values are never
touched. The re-wrap makes its KMS calls outside the write transaction,
runs one at a time, and reports how many rows still sit on a retired key
so you know when the old KEK is safe to remove.

Tests cover the manager round-trip, journal write-order under equal
timestamps, rotation rollback by deleting the newest entry, tombstone
reads, create conflicts, re-wrap idempotence and counting, and a
path-binding regression that confirms a transplanted row will not
decrypt -- plus an end-to-end Vault import against a real Vault dev
server.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet force-pushed the carbide-postgres-secrets branch from b171e20 to db91806 Compare June 16, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant