feat: support storing secrets/credentials in Postgres#2665
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR introduces a Postgres-backed secrets backend as an alternative to Vault KV. It adds AES-256-GCM envelope encryption with per-row DEKs wrapped by configurable KEKs, a prefix-based KEK routing layer, a one-time Vault-to-Postgres import flow, a ChangesPostgres Secrets Backend with KMS, Re-wrap, and Admin CLI
Sequence Diagram(s)sequenceDiagram
participant CLI as admin-cli
participant API as api-core Forge gRPC
participant Handler as handlers::secrets
participant ReWrap as secrets::re_wrap_stale
participant KMS as MultiKmsProvider
participant DB as Postgres secrets table
CLI->>API: ReWrapSecretsRequest { batch_size }
API->>Handler: re_wrap_secrets(request)
Handler->>Handler: clamp batch_size to [1, MAX]
Handler->>ReWrap: re_wrap_stale(pool, kms, routing, batch_size)
ReWrap->>DB: try_lock_path (advisory lock)
loop each batch
ReWrap->>DB: find_batch_after(after_seq, limit)
DB-->>ReWrap: Vec~SecretRow~
loop each row
ReWrap->>KMS: decrypt_dek(row.kek_id, row.encrypted_dek)
KMS-->>ReWrap: dek
ReWrap->>KMS: encrypt_dek(new_kek_id, dek)
KMS-->>ReWrap: new_encrypted_dek
end
ReWrap->>DB: update_dek_wrap (batch transaction)
end
ReWrap->>DB: count_wrapped_outside(routing_kek_ids)
DB-->>ReWrap: stale_remaining
ReWrap-->>Handler: ReWrapStaleResult
Handler-->>API: ReWrapSecretsResponse
API-->>CLI: { re_wrapped, already_current, stale_remaining }
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai PTAL, thanks! |
|
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (9)
crates/api-db/src/secrets.rs (1)
124-130: ⚡ Quick winUse
query_scalarfor single-boolean reads intry_lock_pathandexists.Both queries return exactly one scalar
bool;query_scalaris clearer and removes tuple unpacking.Proposed refactor
pub async fn try_lock_path(txn: &mut PgConnection, path: &str) -> DatabaseResult<bool> { let sql = "SELECT pg_try_advisory_xact_lock(hashtextextended('secrets:' || $1, 0))"; - let row: (bool,) = sqlx::query_as(sql) + let row: bool = sqlx::query_scalar(sql) .bind(path) .fetch_one(txn) .await .map_err(|e| DatabaseError::query(sql, e))?; - Ok(row.0) + Ok(row) } /// Whether any entries exist for the path. pub async fn exists(txn: impl DbReader<'_>, path: &str) -> DatabaseResult<bool> { let sql = "SELECT EXISTS(SELECT 1 FROM secrets WHERE path = $1)"; - let row: (bool,) = sqlx::query_as(sql) + let row: bool = sqlx::query_scalar(sql) .bind(path) .fetch_one(txn) .await .map_err(|e| DatabaseError::query(sql, e))?; - Ok(row.0) + Ok(row) }Based on learnings: in
crates/api-db, prefersqlx::query_scalar(query).bind(...).fetch_one(...)when fetching exactly one scalar value.Also applies to: 135-141
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-db/src/secrets.rs` around lines 124 - 130, Replace the query pattern in both the `try_lock_path` and `exists` functions that use `sqlx::query_as` to fetch a single boolean value wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use `sqlx::query_scalar` which directly fetches the scalar boolean value without requiring tuple unpacking. This makes the code clearer and more idiomatic for single scalar value queries.Source: Learnings
crates/admin-cli/src/secrets/re_wrap/args.rs (1)
20-27: ⚡ Quick winAdd worked
EXAMPLES:help to the re-wrap args struct.Line 21 defines a new operator-facing command args struct but omits the required
#[command(after_long_help = "...")]examples block.Suggested patch
#[derive(Parser, Debug, Clone)] +#[command( + after_long_help = "EXAMPLES:\n\ +\n\ +Run re-wrap with server defaults:\n\ + $ nico-admin-cli secrets re-wrap\n\ +\n\ +Run re-wrap with an explicit batch size:\n\ + $ nico-admin-cli secrets re-wrap --batch-size 1000\n\ +\n" +)] pub struct Args { #[clap( long, help = "Rows scanned per batch during the walk. The server applies its own default and limits." )] pub batch_size: Option<u32>, }As per coding guidelines,
crates/admin-cli/**/*.rsshould attach#[command(after_long_help = "...")]with a concreteEXAMPLES:section for realistic usage.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/secrets/re_wrap/args.rs` around lines 20 - 27, The Args struct in crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required #[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add this macro attribute to the Args struct definition (right after the #[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that demonstrates realistic usage of the re-wrap command with the batch_size parameter, following the coding guidelines established for admin-cli command argument structs.Source: Coding guidelines
crates/admin-cli/src/secrets/re_wrap/cmd.rs (1)
22-26: ⚡ Quick winWrap the RPC failure with operator-facing context.
Line 25 should add a context segment so failure chains read as a coherent “while attempting to …” story.
Suggested patch
use crate::errors::CarbideCliResult; +use eyre::WrapErr as _; use crate::rpc::ApiClient; pub async fn re_wrap(api_client: &ApiClient, batch_size: Option<u32>) -> CarbideCliResult<()> { let request = ::rpc::forge::ReWrapSecretsRequest { batch_size }; - let resp = api_client.0.re_wrap_secrets(request).await?; + let resp = api_client + .0 + .re_wrap_secrets(request) + .await + .wrap_err("while attempting to re-wrap stale secrets via Forge API")?;As per coding guidelines, admin-cli error chains should include
.context()/wrap_err()phrased “while attempting to …” for actionable operator troubleshooting.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs` around lines 22 - 26, The RPC call to re_wrap_secrets in the re_wrap function does not wrap its error with operator-facing context. Modify the await result of api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err() call with a message phrased as "while attempting to ..." (such as "while attempting to re-wrap secrets") before the ? operator, so that error chains provide coherent, actionable troubleshooting information for operators as per admin-cli coding guidelines.Source: Coding guidelines
crates/secrets/src/forge_vault.rs (3)
770-780: ⚡ Quick winDerive
EqwithPartialEqto keep the new enum clippy-clean.
EnumerationModehas only fieldless variants, soEqis correct and avoids clippy noise in a warnings-as-errors workspace. As per coding guidelines, “we enable all clippy lints by default, and treat all warnings as errors.”♻️ Proposed fix
-#[derive(Clone, Copy, PartialEq)] +#[derive(Clone, Copy, PartialEq, Eq)] enum EnumerationMode {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/forge_vault.rs` around lines 770 - 780, The EnumerationMode enum derives Clone, Copy, and PartialEq, but since it only contains fieldless variants, it should also derive Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to the derive attribute list for the EnumerationMode enum to keep it clippy-clean and avoid warnings.Source: Coding guidelines
1105-1108: ⚡ Quick winMake the static-token requirement explicit in the returned error.
For Transit KMS on service-account-auth deployments,
vault_config.token()?currently reports onlyVAULT_TOKEN. Add context so startup points directly at the unsupported auth mode instead of looking like a missing generic Vault setting.♻️ Proposed fix
+ let token = vault_config.token().wrap_err( + "transit KMS requires vault.token or VAULT_TOKEN; Kubernetes service-account auth \ + is not supported by create_raw_vault_client_settings yet", + )?; let mut builder = VaultClientSettingsBuilder::default(); builder - .token(vault_config.token()?) + .token(token) .address(vault_config.address()?)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/forge_vault.rs` around lines 1105 - 1108, The error returned by vault_config.token()? when building the VaultClientSettingsBuilder is too generic and doesn't clarify the root cause. Enhance the error handling around the vault_config.token()? call to provide a more explicit error message that indicates static-token authentication is required but unsupported for Transit KMS on service-account-auth deployments, so users understand the authentication mode limitation rather than treating it as a missing generic Vault setting.
833-835: ⚡ Quick winLog Vault enumeration errors as structured fields.
These new warnings interpolate
einto the message, which makes error filtering harder. Put the error in an attribute so logfmt consumers can search byerror,prefix, andpath. As per coding guidelines, “prefer placing common fields as attributes passed to tracing function, instead of using string interpolation.”♻️ Proposed fix
tracing::warn!( prefix = %dir, - "failed to list vault path: {e}" + error = %e, + "failed to list vault path" );tracing::warn!( path = %path, - "failed to read: {e}" + error = %e, + "failed to read vault secret" );Also applies to: 937-939
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/forge_vault.rs` around lines 833 - 835, The warning logs in the vault path enumeration are interpolating the error `e` directly into the message string using `{e}`, which makes error filtering difficult for logfmt consumers. Refactor the tracing::warn! calls to pass the error as a structured attribute instead of including it in the message string. Remove the `{e}` from the message and add `error = %e` (or similar attribute syntax) as a parameter to the tracing::warn! macro, keeping the existing attributes like `prefix`. This change needs to be applied at the location shown (lines 833-835 in the failed to list vault path warning) and also at the similar warning around lines 937-939 as mentioned in the comment.Source: Coding guidelines
crates/secrets/src/lib.rs (1)
23-27: ⚡ Quick winUpdate the re-export comment for the new Transit helper.
The block now exposes
create_raw_vault_client_settings, which is not forCertificateProviderusage. Please broaden the comment so the public API intent remains accurate.♻️ Proposed fix
-/// Exposed for `CertificateProvider` usage only. Credential operations should go -/// through `create_credential_manager` instead of using the vault client directly. +/// Exposed for direct Vault integrations such as `CertificateProvider` and +/// Transit KMS setup. Credential operations should go through +/// `create_credential_manager` instead of using the vault client directly. pub use crate::forge_vault::{ ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, create_vault_client, };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/lib.rs` around lines 23 - 27, The comment above the re-export block in crate::forge_vault states that the exports are only for CertificateProvider usage, but this is no longer accurate now that create_raw_vault_client_settings is included in the public API. Broaden the comment to accurately describe the public API intent for all re-exported items (ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, and create_vault_client), making it clear that these are exposed for multiple purposes beyond just CertificateProvider usage.crates/api-core/Cargo.toml (1)
57-61: ⚡ Quick winKeep the new internal dependency in alphabetized order.
carbide-kms-providershould sit beforecarbide-libmlx, not aftercarbide-rpc-utils, to match the manifest’s ordering contract.♻️ Proposed fix
carbide-ipmi = { path = "../ipmi" } carbide-ipxe-renderer = { path = "../ipxe-renderer" } +carbide-kms-provider = { path = "../kms-provider" } carbide-libmlx = { path = "../libmlx" } @@ carbide-rpc = { path = "../rpc", features = ["sqlx", "model"] } carbide-rpc-utils = { path = "../rpc-utils" } -carbide-kms-provider = { path = "../kms-provider" } carbide-secrets = { path = "../secrets" }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/Cargo.toml` around lines 57 - 61, The internal dependencies in the Cargo.toml file are not in alphabetical order. Move the carbide-kms-provider dependency to its correct alphabetical position, which should be before carbide-redfish, not after carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish, carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they appear in strict alphabetical order by their crate names.crates/api-core/src/test_support/builder.rs (1)
51-65: ⚡ Quick winAllow tests to build a secrets-enabled
Api.Hard-coding
secrets_context: Nonemeans handler tests for the new secrets/re-wrap path cannot useTestApiBuilder. Add an optional builder field andwith_secrets_contextmethod so tests can exercise both enabled and disabled cases.♻️ Proposed fix
pub struct TestApiBuilder { @@ ib_fabric_manager: Option<Arc<dyn IBFabricManager>>, component_manager: Option<Arc<component_manager::component_manager::ComponentManager>>, + secrets_context: Option<crate::secrets::SecretsContext>, } @@ ib_fabric_manager: None, component_manager: None, + secrets_context: None, } } @@ pub fn with_component_manager( self, component_manager: Arc<component_manager::component_manager::ComponentManager>, ) -> Self { @@ } } + + pub fn with_secrets_context(self, secrets_context: crate::secrets::SecretsContext) -> Self { + Self { + secrets_context: Some(secrets_context), + ..self + } + } @@ - secrets_context: None, + secrets_context: self.secrets_context, } } }Also applies to: 73-87, 153-161, 254-254
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/test_support/builder.rs` around lines 51 - 65, The TestApiBuilder struct currently hard-codes secrets_context to None, preventing tests from building an Api with secrets enabled. Add a new optional field (of appropriate secrets context type) to the TestApiBuilder struct definition, implement a with_secrets_context builder method that allows setting this field, and ensure the field is properly passed through and used when the Api is constructed so that tests can exercise both enabled and disabled cases for the secrets/re-wrap path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/admin-cli/src/cfg/cli_options.rs`:
- Around line 205-206: The new Secrets variant added to the CLI options enum in
cli_options.rs at lines 205–206 lacks a corresponding domain mapping entry in
cli_domains.yaml, which will cause the every_command_has_a_domain test to fail
during CI. Add a secrets entry under the appropriate domain list (admin or
tenant, based on the operational context of the secrets command) in
cli_domains.yaml to ensure the command is properly registered in the domain
configuration.
In `@crates/api-core/src/cfg/file.rs`:
- Around line 863-891: Add the `#[serde(deny_unknown_fields)]` attribute to the
SecretsConfig struct to reject misspelled configuration keys like `import_fom`
instead of silently ignoring them. Additionally, apply the same
`#[serde(deny_unknown_fields)]` attribute to the KmsConfig struct (located
around lines 903-910) for consistency. Finally, add a regression test that
verifies the configuration parser correctly rejects a misspelled optional field
(for example, `import_fom` instead of `import_from`) in the `[secrets]` section,
ensuring startup validation prevents such typos.
In `@crates/api-core/src/run.rs`:
- Around line 515-560: The code holds the transaction lock_txn (which contains a
pooled connection and advisory lock) while performing long-running operations
like vault_client.get_secrets_strict().await and
crate::secrets::import_secrets().await. This creates a deadlock risk during
concurrent replica startup where other waiters can block on the same lock while
occupying remaining pool connections. Restructure the code to use short
transactions: acquire the lock and check completion status in one short
transaction, then release it before calling
vault_client.get_secrets_strict().await and the import_secrets function, and
finally use another short transaction (reusing the same connection if possible,
or using a distinct lock key) to mark the import as complete. This applies at
the location shown (lines 515-560, specifically around the vault enumeration and
import_secrets calls) and also at lines 569-576 where similar long-running
operations may be awaited while holding a transaction.
In `@crates/api-core/src/secrets/re_wrap.rs`:
- Around line 72-88: The re-wrap operation holds a lock transaction from the
pool while the loop in `find_batch_after` and subsequent batch operations try to
acquire additional connections, causing potential deadlock under pool
saturation. Replace the transaction-based locking approach with a single
dedicated connection obtained from the pool at the start. Use a session advisory
lock (via db::secrets::try_lock_path) on that dedicated connection instead of a
transaction, then execute all scan and update operations on that same connection
throughout the loop to avoid holding the transaction open across awaited work.
Apply the same fix at the second location mentioned (lines 126-129) where
similar transaction holding occurs.
In `@crates/api-core/src/secrets/routing.rs`:
- Around line 65-91: In the from_config method's validation loop (where prefix
and kek_id are validated), add a check to reject any prefix that starts with "/"
unless it matches the CATCH_ALL constant. This prevents prefixes like
"/machines/bmc" from being silently misrouted to the catch-all, which could
cause writes to be encrypted under the wrong KEK. Add this validation check
after the empty prefix check and before the normalized prefix processing.
In `@crates/api-db/migrations/20260416000305_create_secrets_table.sql`:
- Around line 21-22: The `seq` column in the migration file uses `BIGINT
GENERATED ALWAYS AS IDENTITY` but is missing a `UNIQUE` constraint. Since
PostgreSQL's GENERATED ALWAYS AS IDENTITY does not enforce uniqueness by itself,
duplicates can occur from sequence resets or manual insertions, breaking the
pagination contract used by functions like `find_batch_after` that rely on `seq
> $1` ordering. Add the `UNIQUE` constraint to the `seq` column definition in
the CREATE TABLE statement to ensure deterministic batch ordering and pagination
behavior.
In `@crates/api-integration-tests/tests/secrets_import.rs`:
- Around line 31-35: The allocate_port() function at lines 31-35 releases the
TcpListener immediately after binding, creating a race condition where another
process can claim the port before it's used at lines 237-238. Fix this by either
keeping the listener alive and passing it to the Vault startup code instead of
just the port number, or by implementing a retry mechanism that attempts to bind
to the port immediately before starting Vault (lines 237-238), retrying if the
port becomes unavailable due to race conditions. This ensures the port
reservation and Vault startup happen atomically without releasing the binding in
between.
- Around line 216-219: The call to db::migrations::MIGRATOR.run() uses .expect()
which causes a panic that bypasses the cleanup logic if migrations fail, leaking
temporary databases. Replace the .expect() call with a fallible pattern using
the ? operator, and restructure the code to ensure the drop/cleanup logic
(currently at lines 225-228) always executes after setup and exercise,
regardless of whether migrations succeed or fail. This typically means using a
guard pattern or ensuring cleanup runs in a finally-like block that is
guaranteed to execute even when earlier operations return errors.
- Around line 212-214: The string concatenation used to build test_db_url at
line 212 breaks when the original db_url contains a database path or query
parameters. Instead of using format!("{db_url}/{test_db_name}"), parse the
db_url into its connection components using the appropriate PostgreSQL URL
parsing utilities, modify the database name field in the parsed options to use
test_db_name, and then reconstruct the connection URL from the parsed options.
This ensures existing query parameters and path components in the original URL
are preserved when building the test database connection string.
In `@crates/rpc/proto/forge.proto`:
- Around line 8550-8554: The `batch_size` field in the `ReWrapSecretsRequest`
message lacks documentation about its validation contract. Update the comment
for the `batch_size` field to explicitly document the semantics for when the
field is unset (its default behavior), what happens if zero is provided, and any
upper-bound constraints the server enforces. This ensures that generated clients
understand the expected validation rules and can properly handle edge cases.
- Around line 946-947: In the crates/rpc/build.rs file, add explicit serde
Serialize annotations for the ReWrapSecretsRequest and ReWrapSecretsResponse
message types using the .type_attribute() method calls, similar to how
comparable gRPC Request/Response types like ListOsImageRequest and
GetBmcCredentialsRequest are configured. This will ensure that these types have
the necessary Serialize trait implementation required by the log_request_data()
function called in the handler at crates/api-core/src/handlers/secrets.rs,
enabling proper request logging and serialization.
---
Nitpick comments:
In `@crates/admin-cli/src/secrets/re_wrap/args.rs`:
- Around line 20-27: The Args struct in
crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.
In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs`:
- Around line 22-26: The RPC call to re_wrap_secrets in the re_wrap function
does not wrap its error with operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.
In `@crates/api-core/Cargo.toml`:
- Around line 57-61: The internal dependencies in the Cargo.toml file are not in
alphabetical order. Move the carbide-kms-provider dependency to its correct
alphabetical position, which should be before carbide-redfish, not after
carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish,
carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they
appear in strict alphabetical order by their crate names.
In `@crates/api-core/src/test_support/builder.rs`:
- Around line 51-65: The TestApiBuilder struct currently hard-codes
secrets_context to None, preventing tests from building an Api with secrets
enabled. Add a new optional field (of appropriate secrets context type) to the
TestApiBuilder struct definition, implement a with_secrets_context builder
method that allows setting this field, and ensure the field is properly passed
through and used when the Api is constructed so that tests can exercise both
enabled and disabled cases for the secrets/re-wrap path.
In `@crates/api-db/src/secrets.rs`:
- Around line 124-130: Replace the query pattern in both the `try_lock_path` and
`exists` functions that use `sqlx::query_as` to fetch a single boolean value
wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use
`sqlx::query_scalar` which directly fetches the scalar boolean value without
requiring tuple unpacking. This makes the code clearer and more idiomatic for
single scalar value queries.
In `@crates/secrets/src/forge_vault.rs`:
- Around line 770-780: The EnumerationMode enum derives Clone, Copy, and
PartialEq, but since it only contains fieldless variants, it should also derive
Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to
the derive attribute list for the EnumerationMode enum to keep it clippy-clean
and avoid warnings.
- Around line 1105-1108: The error returned by vault_config.token()? when
building the VaultClientSettingsBuilder is too generic and doesn't clarify the
root cause. Enhance the error handling around the vault_config.token()? call to
provide a more explicit error message that indicates static-token authentication
is required but unsupported for Transit KMS on service-account-auth deployments,
so users understand the authentication mode limitation rather than treating it
as a missing generic Vault setting.
- Around line 833-835: The warning logs in the vault path enumeration are
interpolating the error `e` directly into the message string using `{e}`, which
makes error filtering difficult for logfmt consumers. Refactor the
tracing::warn! calls to pass the error as a structured attribute instead of
including it in the message string. Remove the `{e}` from the message and add
`error = %e` (or similar attribute syntax) as a parameter to the tracing::warn!
macro, keeping the existing attributes like `prefix`. This change needs to be
applied at the location shown (lines 833-835 in the failed to list vault path
warning) and also at the similar warning around lines 937-939 as mentioned in
the comment.
In `@crates/secrets/src/lib.rs`:
- Around line 23-27: The comment above the re-export block in crate::forge_vault
states that the exports are only for CertificateProvider usage, but this is no
longer accurate now that create_raw_vault_client_settings is included in the
public API. Broaden the comment to accurately describe the public API intent for
all re-exported items (ForgeVaultClient, VaultConfig,
create_raw_vault_client_settings, and create_vault_client), making it clear that
these are exposed for multiple purposes beyond just CertificateProvider usage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9bd7cccb-fd3d-47e1-9d17-4ccbe5fee75a
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (41)
crates/admin-cli/src/cfg/cli_options.rscrates/admin-cli/src/main.rscrates/admin-cli/src/secrets/mod.rscrates/admin-cli/src/secrets/re_wrap/args.rscrates/admin-cli/src/secrets/re_wrap/cmd.rscrates/admin-cli/src/secrets/re_wrap/mod.rscrates/api-core/Cargo.tomlcrates/api-core/src/api.rscrates/api-core/src/auth/internal_rbac_rules.rscrates/api-core/src/cfg/file.rscrates/api-core/src/handlers/mod.rscrates/api-core/src/handlers/secrets.rscrates/api-core/src/lib.rscrates/api-core/src/run.rscrates/api-core/src/secrets/import.rscrates/api-core/src/secrets/metrics.rscrates/api-core/src/secrets/mod.rscrates/api-core/src/secrets/re_wrap.rscrates/api-core/src/secrets/routing.rscrates/api-core/src/secrets/tests.rscrates/api-core/src/setup.rscrates/api-core/src/test_support/builder.rscrates/api-core/src/test_support/default_config.rscrates/api-db/migrations/20260416000305_create_secrets_table.sqlcrates/api-db/src/lib.rscrates/api-db/src/secrets.rscrates/api-integration-tests/Cargo.tomlcrates/api-integration-tests/tests/secrets_import.rscrates/api-model/src/lib.rscrates/api-model/src/secrets.rscrates/kms-provider/Cargo.tomlcrates/kms-provider/src/crypto.rscrates/kms-provider/src/lib.rscrates/kms-provider/src/providers/integrated.rscrates/kms-provider/src/providers/multi.rscrates/kms-provider/src/providers/transit.rscrates/rpc/proto/forge.protocrates/secrets/src/forge_vault.rscrates/secrets/src/lib.rscrates/uuid/src/lib.rscrates/uuid/src/secret.rs
| /// allocate_port picks a free port by binding to port 0. | ||
| fn allocate_port() -> SocketAddr { | ||
| let listener = TcpListener::bind("127.0.0.1:0").expect("bind to free port"); | ||
| listener.local_addr().expect("local addr") | ||
| } |
There was a problem hiding this comment.
Port reservation is racy and can cause intermittent test failures.
Line 31-35 releases the listener before Line 238 starts Vault, so another process can claim the port in between. This is a classic TOCTOU race and a common CI flake source for integration tests. Prefer a helper that binds and starts atomically (or retries on bind conflicts).
As per coding guidelines, integration tests should be deterministic and avoid flaky behavior.
Also applies to: 237-238
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/api-integration-tests/tests/secrets_import.rs` around lines 31 - 35,
The allocate_port() function at lines 31-35 releases the TcpListener immediately
after binding, creating a race condition where another process can claim the
port before it's used at lines 237-238. Fix this by either keeping the listener
alive and passing it to the Vault startup code instead of just the port number,
or by implementing a retry mechanism that attempts to bind to the port
immediately before starting Vault (lines 237-238), retrying if the port becomes
unavailable due to race conditions. This ensures the port reservation and Vault
startup happen atomically without releasing the binding in between.
Source: Coding guidelines
There was a problem hiding this comment.
@coderabbitai Fair -- it is a real TOCTOU, but the atomic bind-and-launch would have to live in how api_test_helper spawns the external Vault process, not in this test, so it's out of scope for this PR. Leaving as-is. Thanks for checking!
There was a problem hiding this comment.
@chet, understood — the atomic bind-and-launch fix is appropriately scoped to the api_test_helper process spawner, not this test file. Happy to leave it as-is here.
Would you like me to open a follow-up GitHub issue to track the api_test_helper improvement so it is not lost?
Carbide can now keep its credentials in Postgres instead of Vault, encrypted so that a copy of the database on its own gives up nothing. This swaps out Vault KV as the credential store; Vault still issues the PKI certificates. The encryption is layered. Every credential gets its own data key, and that data key is wrapped by a key encryption key (KEK) that lives outside the database -- in carbide's own config or an external KMS. The credential's path is mixed into the encryption as well, so a row lifted out and dropped onto another path will not decrypt. The KEK is the only thing an operator has to protect; everything else lives in Postgres. `PostgresCredentialManager` implements the same `CredentialManager` traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which is what the upcoming rotation work uses for history and rollback. Where the wrapping keys come from is pluggable through `KmsBackend`: local key material (`integrated`), or Vault/OpenBao Transit, and more than one provider at once. `[secrets.routing]` maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails the boot instead of some write hours later. Moving an existing site off Vault is a one-time import at startup, and it is deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with `[secrets]` set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on `SecretsConfig`. Rotating a KEK is a config change plus one command: point the route at the new key and run `carbide-admin-cli secrets re-wrap`. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove. Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
b171e20 to
db91806
Compare
Description
This supports #353.
This introduces a new secrets/credentials storage backend of sorts, such that NICo can now keep its credentials in Postgres instead of Vault, encrypted (with envelope encryption, leaning on #939) so that a copy of the database on its own gives up nothing. This entirely feature/flagged and configurable, currently not enabled at all, but will let us:
The encryption is layered. Every credential gets its own data encryption key (DEK), and that DEK is wrapped by a key encryption key (KEK) that lives outside the database -- in NICo itself or whatever
KmsProviderwe choose. The credential's path is mixed into the encryption as well. The KEK is the only thing an operator has to protect; everything else lives in Postgres (some reference here and here.PostgresCredentialManagerimplements the sameCredentialManagertraits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which the rotation work going on in #367 can use for history, rollback, rotation, etc.Where the wrapping keys come from is pluggable through
KmsBackend: local key material (integrated), or Vault/OpenBao Transit secrets engine API, and more than one provider at once.[secrets.routing]maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails at startup.Moving an existing site off Vault could be:
[secrets]set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out onSecretsConfig.CredentialWriter, such that all new credentials get written to Postgres, with Vault still in theChainedCredentialReaderpath, allowing us to slowly migrate over.BUT, migration is NOT a part of this PR. This PR is for getting some initial code in place, and then we will iterate on it, and then do subsequent work for migration planning.
Rotating a KEK is a config change plus one command: point the route at the new key and run
carbide-admin-cli secrets re-wrap. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove.Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server.
Signed-off-by: Chet Nichols III chetn@nvidia.com
Related issues
Type of Change
Breaking Changes
Testing
Additional Notes