Skip to content

fix: Add dev mode for RLA service to enable certain options only for dev#360

Merged
jw-nvidia merged 1 commit intoNVIDIA:mainfrom
jw-nvidia:fix/security
Apr 14, 2026
Merged

fix: Add dev mode for RLA service to enable certain options only for dev#360
jw-nvidia merged 1 commit intoNVIDIA:mainfrom
jw-nvidia:fix/security

Conversation

@jw-nvidia
Copy link
Copy Markdown
Contributor

@jw-nvidia jw-nvidia commented Apr 8, 2026

Description

  • enable gRPC reflection only in dev mode
  • set log level to debug in dev mode, otherwise info level.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@jw-nvidia jw-nvidia requested a review from a team as a code owner April 8, 2026 16:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)

Walkthrough

Adds a --dev-mode flag to rla serve, propagates it into service Config, enables debug logging and gRPC reflection only in dev mode, introduces RLA_ENV resolution and startup validation (including TLS availability checks), updates README, and adds unit tests for validation.

Changes

Cohort / File(s) Summary
CLI & Flags
rla/cmd/serve.go, rla/cmd/root.go
Added --dev-mode boolean flag; doServe() sets zerolog to Debug when enabled, resolves deployment env and passes DevMode into svc.New; root command default port changed to defaultServicePort.
Config & Validation
rla/internal/service/config.go, rla/internal/service/config_test.go
Added EnvVarName = "RLA_ENV", GetDeploymentEnv(), DevMode bool on Config, and Config.Validate() enforcing allowed envs, forbidding DevMode outside development, and requiring TLS in staging/production. Added comprehensive table-driven tests for validation and precedence.
Service Startup & Scheduler
rla/internal/service/service.go
New invokes Config.Validate() and returns validation errors; Start registers gRPC reflection only when conf.DevMode is true; scheduler now only schedules inventory-sync and leak-detection when job instances are non-nil.
TLS helpers
rla/internal/certs/certs.go
Added IsTLSAvailable(pkgcerts.Config) bool to check existence of CA/tls cert/key either from explicit paths or CERTDIR without loading TLS configs.
Documentation
rla/README.md
Clarified local run instructions: require --dev-mode to enable gRPC reflection (and debug logging) and noted that grpcui/grpcurl examples assume server started with --dev-mode.
Tests
rla/internal/certs/certs_test.go, rla/internal/service/config_test.go
Added tests validating IsTLSAvailable behavior for various cert presence scenarios and table-driven tests covering Config.Validate() permutations, including dev-mode precedence and env validation.

Sequence Diagram(s)

sequenceDiagram
  participant User as User / CLI
  participant CLI as "rla serve"
  participant Env as Env Resolver
  participant Validator as Config Validator
  participant Service as Service.New / Start
  participant Logger as Zerolog
  participant Reflection as gRPC Reflection
  participant Scheduler as Scheduler
  participant Tool as "grpcui / grpcurl"

  User->>CLI: run `./rla serve [--dev-mode]`
  CLI->>Env: GetDeploymentEnv()
  Env-->>CLI: return env or error
  CLI->>Validator: Build Config (DevMode) -> Validate()
  Validator-->>CLI: success or error
  CLI->>Service: New(config) -> Start()
  Service->>Logger: set level (Debug if DevMode else Info)
  Service->>Reflection: register (only if DevMode)
  Service->>Scheduler: schedule jobs (only if job instances non-nil)
  Tool->>Service: request (requires reflection for grpcui/grpcurl)
  Reflection-->>Tool: provide descriptors (only when registered)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective: adding dev mode to enable gRPC reflection and debug logging conditionally, which is the central theme across all modified files.
Description check ✅ Passed The description is directly related to the changeset, clearly articulating the two primary goals: conditional gRPC reflection and debug-level logging in dev mode.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rla/internal/service/config.go`:
- Around line 50-53: Add a fail-fast validation that enforces the comment on
DevMode: implement a Config.Validate() method (or add validation in the service
initialization path that constructs service.Config) which checks the DevMode
bool and returns/throws an error when DevMode is true while the runtime
indicates a production environment; update the service constructor (e.g.,
NewService or InitService) to call Config.Validate() and abort startup on
validation failure so reflection/debug options cannot be enabled in production.
Ensure the validation uses existing environment/flags used by the project to
determine "production" and include a clear error message referencing
service.Config.DevMode in the returned error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a04cbcf4-0629-4703-81ff-87ee72a97965

📥 Commits

Reviewing files that changed from the base of the PR and between da1d558 and 09aa245.

📒 Files selected for processing (4)
  • rla/README.md
  • rla/cmd/serve.go
  • rla/internal/service/config.go
  • rla/internal/service/service.go
✅ Files skipped from review due to trivial changes (2)
  • rla/internal/service/service.go
  • rla/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • rla/cmd/serve.go

Comment thread rla/internal/service/config.go
@jw-nvidia jw-nvidia changed the title Fix: Add dev mode for RLA service to enable certain options only for dev fix: Add dev mode for RLA service to enable certain options only for dev Apr 8, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
rla/internal/service/config_test.go (1)

29-33: Make TLS-present fixtures use real temp files instead of non-empty path strings.

tlsConfig() currently treats “present” as string-populated paths only. Creating actual files in t.TempDir() will keep these tests valid if TLS availability checks are strengthened to verify file existence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/service/config_test.go` around lines 29 - 33, Replace the
current tlsConfig that returns non-existent path strings with one that creates
real temp files and returns their paths: use t.TempDir() inside the test to
create a temp directory, create empty files named "ca.crt", "tls.crt", and
"tls.key" there, and return pkgcerts.Config with CACert, TLSCert and TLSKey set
to those real file paths (update the tlsConfig helper or inline this logic where
used so tests create and clean up real files before asserting IsTLSAvailable).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rla/internal/certs/certs.go`:
- Around line 52-63: The IsTLSAvailable logic returns false positives because
c.IsSet() only checks for non-empty config values and the function only verifies
existence of defaultCACert; update IsTLSAvailable (and/or c.IsSet()) to require
that all three certificate files exist: filepath.Join(certDir, defaultCACert),
filepath.Join(certDir, defaultTLSCert), and filepath.Join(certDir,
defaultTLSKey) (or if c points to explicit paths, verify those files exist
instead of just non-empty strings). Modify the function to compute certDir
(using defaultCertDir when env empty), then stat each required file and return
true only if all three os.Stat calls succeed (err == nil), otherwise return
false.

---

Nitpick comments:
In `@rla/internal/service/config_test.go`:
- Around line 29-33: Replace the current tlsConfig that returns non-existent
path strings with one that creates real temp files and returns their paths: use
t.TempDir() inside the test to create a temp directory, create empty files named
"ca.crt", "tls.crt", and "tls.key" there, and return pkgcerts.Config with
CACert, TLSCert and TLSKey set to those real file paths (update the tlsConfig
helper or inline this logic where used so tests create and clean up real files
before asserting IsTLSAvailable).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 51c9b2c9-e9b7-47b6-b45e-e29e238a5160

📥 Commits

Reviewing files that changed from the base of the PR and between 09aa245 and 26c33c8.

📒 Files selected for processing (7)
  • rla/README.md
  • rla/cmd/root.go
  • rla/cmd/serve.go
  • rla/internal/certs/certs.go
  • rla/internal/service/config.go
  • rla/internal/service/config_test.go
  • rla/internal/service/service.go
✅ Files skipped from review due to trivial changes (1)
  • rla/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • rla/cmd/serve.go
  • rla/internal/service/service.go

Comment thread rla/internal/certs/certs.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rla/internal/service/config.go`:
- Around line 83-85: The DevMode field's comment is inconsistent with validation
that also forbids staging; update the DevMode docstring to reflect the actual
rule (e.g., "Development only — not allowed in staging or production" or
similar) and make the same wording change for any other DevMode comments in this
file so the documentation matches the runtime validation that rejects staging
and production.
- Around line 112-115: Add a fail-fast check that rejects a partially populated
explicit CertConfig before calling certs.IsTLSAvailable: examine c.CertConfig's
explicit-path fields (the explicit certificate path fields on the CertConfig
struct) and if any are set but not all required ones are set, return an error
stating the explicit-path config is incomplete; only if the explicit-path fields
are either all empty or all present should you proceed to call
certs.IsTLSAvailable(c.CertConfig) for fallback probing (this check should live
immediately before the existing envStaging/envProduction TLS guard that uses
certs.IsTLSAvailable).
- Around line 40-43: The code currently defaults missing RLA_ENV to
"development" which makes TLS checks in Validate() opt-out; change the behavior
so that when EnvVarName (RLA_ENV) is not set the service treats that as an error
(fail closed) instead of defaulting to "development" — update the code that
reads EnvVarName and any helper that maps empty -> "development" to return an
explicit error or require the caller to set the env, and adjust any callers of
Validate() to surface that error; keep a separate local/dev entrypoint (not in
rla/internal/service/config.go) to set a safe default of "development" for local
runs if needed. Ensure references: EnvVarName, the function that reads/parses
the environment var (the current mapping block around lines ~58-61), and
Validate() are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 05ef899e-010e-48d4-8d00-8db916b62490

📥 Commits

Reviewing files that changed from the base of the PR and between 26c33c8 and d16f285.

📒 Files selected for processing (7)
  • rla/README.md
  • rla/cmd/root.go
  • rla/cmd/serve.go
  • rla/internal/certs/certs.go
  • rla/internal/service/config.go
  • rla/internal/service/config_test.go
  • rla/internal/service/service.go
✅ Files skipped from review due to trivial changes (2)
  • rla/README.md
  • rla/cmd/root.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • rla/internal/certs/certs.go
  • rla/cmd/serve.go
  • rla/internal/service/service.go

Comment thread rla/internal/service/config.go
Comment thread rla/internal/service/config.go
Comment thread rla/internal/service/config.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
rla/internal/certs/certs_test.go (1)

155-222: Add a precedence regression for explicit paths over CERTDIR.

These cases prove the explicit-path branch and the CERTDIR branch separately, but they never assert that a broken fully populated pkgcerts.Config does not fall back to a valid CERTDIR. That matters because rla/internal/service/config.go Lines 122-124 use IsTLSAvailable as the staging/production startup gate.

💡 Suggested subtest
 	t.Run("explicit paths: tls.key missing", func(t *testing.T) {
 		dir := stubCerts(t, defaultCACert, defaultCertFile)
 		c := pkgcerts.Config{
 			CACert:  filepath.Join(dir, defaultCACert),
 			TLSCert: filepath.Join(dir, defaultCertFile),
 			TLSKey:  filepath.Join(dir, defaultKeyFile),
 		}
 		assert.False(t, IsTLSAvailable(c))
 	})
+
+	t.Run("explicit paths take precedence over CERTDIR", func(t *testing.T) {
+		badDir := stubCerts(t, defaultCACert, defaultCertFile) // tls.key absent
+		goodDir := stubCerts(t, defaultCACert, defaultCertFile, defaultKeyFile)
+		t.Setenv("CERTDIR", goodDir)
+
+		c := pkgcerts.Config{
+			CACert:  filepath.Join(badDir, defaultCACert),
+			TLSCert: filepath.Join(badDir, defaultCertFile),
+			TLSKey:  filepath.Join(badDir, defaultKeyFile),
+		}
+		assert.False(t, IsTLSAvailable(c))
+	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/certs/certs_test.go` around lines 155 - 222, Add a regression
test that verifies explicit paths in pkgcerts.Config take precedence over
CERTDIR: in certs_test.go add a subtest (e.g., "explicit paths take precedence
over CERTDIR") that stubs a valid cert directory and sets t.Setenv("CERTDIR",
dir) but constructs a pkgcerts.Config whose explicit CACert/TLSCert/TLSKey point
to missing/invalid files, then assert IsTLSAvailable(config) is false; reference
IsTLSAvailable and pkgcerts.Config so the test ensures a broken explicit config
does not fall back to a valid CERTDIR.
rla/internal/service/config_test.go (1)

54-57: The “unset env” rows are only testing the empty-string path.

Line 231 always calls t.Setenv, so the cases named “unset env” never exercise a truly missing RLA_ENV. Either rename those rows to “empty env” or add one real unset-path case so the table matches the contract in GetDeploymentEnv().

Also applies to: 226-231

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/service/config_test.go` around lines 54 - 57, The table-driven
tests name some cases "unset env" but the loop always calls t.Setenv, so those
never exercise a truly missing RLA_ENV; update the test table/loop for
GetDeploymentEnv(): either rename the rows labeled "unset env" to "empty env",
or add an explicit unset path by introducing a field (e.g., unsetEnv bool) in
the test struct and, in the test loop, call t.Setenv("RLA_ENV", value) only when
unsetEnv is false (otherwise call os.Unsetenv("RLA_ENV") or
t.Unsetenv("RLA_ENV") so the env is truly absent); make the same change for the
other occurrences referenced around the test loop (the code that currently
always calls t.Setenv).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@rla/internal/certs/certs_test.go`:
- Around line 155-222: Add a regression test that verifies explicit paths in
pkgcerts.Config take precedence over CERTDIR: in certs_test.go add a subtest
(e.g., "explicit paths take precedence over CERTDIR") that stubs a valid cert
directory and sets t.Setenv("CERTDIR", dir) but constructs a pkgcerts.Config
whose explicit CACert/TLSCert/TLSKey point to missing/invalid files, then assert
IsTLSAvailable(config) is false; reference IsTLSAvailable and pkgcerts.Config so
the test ensures a broken explicit config does not fall back to a valid CERTDIR.

In `@rla/internal/service/config_test.go`:
- Around line 54-57: The table-driven tests name some cases "unset env" but the
loop always calls t.Setenv, so those never exercise a truly missing RLA_ENV;
update the test table/loop for GetDeploymentEnv(): either rename the rows
labeled "unset env" to "empty env", or add an explicit unset path by introducing
a field (e.g., unsetEnv bool) in the test struct and, in the test loop, call
t.Setenv("RLA_ENV", value) only when unsetEnv is false (otherwise call
os.Unsetenv("RLA_ENV") or t.Unsetenv("RLA_ENV") so the env is truly absent);
make the same change for the other occurrences referenced around the test loop
(the code that currently always calls t.Setenv).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bcde312f-54d8-4113-9a03-240e03ed1a9a

📥 Commits

Reviewing files that changed from the base of the PR and between d16f285 and ce72400.

📒 Files selected for processing (8)
  • rla/README.md
  • rla/cmd/root.go
  • rla/cmd/serve.go
  • rla/internal/certs/certs.go
  • rla/internal/certs/certs_test.go
  • rla/internal/service/config.go
  • rla/internal/service/config_test.go
  • rla/internal/service/service.go
✅ Files skipped from review due to trivial changes (3)
  • rla/cmd/root.go
  • rla/README.md
  • rla/internal/certs/certs.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • rla/internal/service/service.go
  • rla/cmd/serve.go

@jw-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 52c3862

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-09 15:45:15 UTC | Commit: 52c3862

Comment thread rla/internal/certs/certs.go
- if dev mode is enabled, enable gRPC reflection and set log level to be debug
- add validations:
  - dev mode can be enabled only on development deployment env
  - TLS must be available on production and staging depolymnet env
- skip registering jobs with scheduler if the required services are not available

Signed-off-by: Jin Wang <jinwan@nvidia.com>
@jw-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 295f26f

@jw-nvidia jw-nvidia merged commit fcf7893 into NVIDIA:main Apr 14, 2026
51 checks passed
@jw-nvidia jw-nvidia deleted the fix/security branch April 14, 2026 16:51
Comment thread rla/cmd/serve.go

if os.Getenv(svc.EnvVarName) == "" {
log.Warn().Msgf("%s not set, defaulting to %q for local development", svc.EnvVarName, "development")
os.Setenv(svc.EnvVarName, "development") //nolint:errcheck
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?
This bypass the design of GetDeploymentEnv() fail-closed design when no RLA_ENV set.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least set this env when devMode == true && RLA_ENV == ""?

}

// Rule 3: staging and production require TLS.
if (env == envStaging || env == envProduction) && !certs.IsTLSAvailable(c.CertConfig) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate() calls certs.IsTLSAvailable() to gate staging/production startup. But certOption() has its own ALLOW_INSECURE_GRPC=true escape hatch that bypasses TLS entirely and logs a Fatal otherwise. In development with ALLOW_INSECURE_GRPC=true, the Validate TLS check passes (development doesn't require TLS), so this works. But the two-layer TLS check could be confusing for future maintainers — Validate and certOption express the same policy in different places with different mechanisms.

}
if err := sched.Schedule(invJob, invTrigger, schedtypes.Skip); err != nil {
return fmt.Errorf("failed to schedule inventory sync job: %w", err)
if invJob != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change wraps scheduler job registration in if invJob != nil / if leakJob != nil. If the jobs returning nil without error from their constructors means a misconfigured provider silently disables the job. A log.Warn() when a job is nil would help operators diagnose why inventory sync or leak detection isn't running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants