Implement fleet-server and elastic-agent support for presenting client certificates to elasticsearch#9234
Conversation
✅ Vale Linting ResultsNo issues found on modified lines! The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale. |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
b063c40 to
bae25c6
Compare
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.1,s=8.19.2,s=9.2.6 |
a1e1dda to
9281356
Compare
bae25c6 to
7f3be6d
Compare
9281356 to
9cea7f4
Compare
7f3be6d to
2d54323
Compare
9cea7f4 to
2ebd22b
Compare
2d54323 to
39f0764
Compare
2ebd22b to
2f6b198
Compare
39f0764 to
0bf556c
Compare
2f6b198 to
0191bcc
Compare
0bf556c to
2a707bb
Compare
faf43c5 to
308d97a
Compare
2a707bb to
5d29636
Compare
308d97a to
d1e6a75
Compare
5d29636 to
65b9ae2
Compare
🔍 Preview links for changed docs |
65b9ae2 to
098f7f6
Compare
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.2,s=8.19.13,s=9.2.7 |
098f7f6 to
f253875
Compare
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.2,s=8.19.13,s=9.2.7 |
5e66fd4 to
267d115
Compare
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.2.8 |
| return nil, nil | ||
| } | ||
|
|
||
| ref := esAssociation.AssociationRef() |
There was a problem hiding this comment.
please ignore if this does not make sense, will this ref be a namespaced Name of Elasticsearch object or can it also return a secret (external reference, unmanaged Elasticsearch connection secret) ? asking because below we assume this returns an Elasticsearch object
There was a problem hiding this comment.
nice catch!! yes that was a gap and I fixed it in 4a3c2bc thx
pebrc
left a comment
There was a problem hiding this comment.
The PR adds a well-structured mTLS client certificate system for fleet-server and elastic-agent connections to Elasticsearch. The core certificate lifecycle (generation, rotation, trust bundle assembly, cleanup) builds cleanly on top of the existing certificates.Reconciler infrastructure. The transitive association chain (Agent → Fleet Server → ES) is handled through a clear callback mechanism (ReconcileTransitiveESSecrets) that avoids coupling the generic association reconciler to agent-specific logic.
Integration testing on a live cluster confirmed all three main paths work correctly: standalone agent with managed certs, fleet-managed agent with transitive cert propagation + Kibana ssl injection, and clean teardown when client auth is disabled.
Key suggestions:
-
External ES ref safety —
fleetManagedAgentTransitiveESRefdoes ac.Geton the ESNamespacedNamewithout checkingref.IsExternal(), which would fail for external ES references. (medium) -
Kibana config injection style — The
injectFleetOutput*functions use raw ucfg manipulation rather than the typed struct Unpack/MergeWith pattern used everywhere else inconfig_settings.go. This works but is inconsistent and harder to maintain. (low) -
URL matching —
outputHostsContainuses exact string equality. A trailing slash or any other equivalent-but-not-identical URL form would silently skip injection with no log output. (low)
| func injectFleetOutputClientCerts(cfg *settings.CanonicalConfig, esURL string) error { | ||
| ucfgCfg := (*ucfg.Config)(cfg) | ||
| opts := settings.Options | ||
|
|
||
| for i := 0; ; i++ { | ||
| output, childErr := ucfgCfg.Child("xpack.fleet.outputs", i, opts...) | ||
| if childErr != nil { | ||
| // childErr signals end of indexed children, not a real error | ||
| break | ||
| } | ||
|
|
||
| outputType, err := output.String("type", -1, opts...) | ||
| if err != nil || outputType != "elasticsearch" { | ||
| continue | ||
| } | ||
|
|
||
| if !outputHostsContain(output, esURL) { | ||
| continue | ||
| } | ||
|
|
||
| if has, _ := output.Has("ssl.certificate", -1, opts...); !has { | ||
| if err := output.SetString("ssl.certificate", -1, path.Join(agent.FleetManagedAgentClientCertDir, | ||
| certificates.CertFileName), opts...); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| if has, _ := output.Has("ssl.key", -1, opts...); !has { | ||
| if err := output.SetString("ssl.key", -1, path.Join(agent.FleetManagedAgentClientCertDir, | ||
| certificates.KeyFileName), opts...); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return nil //nolint:nilerr | ||
| } |
There was a problem hiding this comment.
suggestion (style): Both injectFleetOutputClientCerts and injectFleetOutputCertificateAuthorities use raw (*ucfg.Config) pointer casts with manual Child()/Has()/SetString() calls. This is the only place in config_settings.go that manipulates ucfg at this level — every other config section uses the idiomatic pattern of typed Go structs + Unpack/MustCanonicalConfig/MergeWith.
Consider aligning with the existing pattern, e.g.:
type fleetOutputSSL struct {
Certificate string `config:"certificate,omitempty"`
Key string `config:"key,omitempty"`
CertificateAuthorities []string `config:"certificate_authorities,omitempty"`
}
type fleetOutput struct {
Type string `config:"type"`
Hosts []string `config:"hosts"`
SSL *fleetOutputSSL `config:"ssl,omitempty"`
}
type fleetOutputsWrapper struct {
Outputs []fleetOutput `config:"xpack.fleet.outputs"`
}Then unpack, mutate, merge back. This would also make the URL matching more obvious and testable on plain Go structs, and would be consistent with the approach PR #9127 already takes for its fleet output structs.
There was a problem hiding this comment.
I don't think that the typed approach is gonna work out of the box here. Specifically when round-tripping through unpack → mutate → MergeWith, nil pointer fields leak as null and nil slices as [] in the rendered YAML. This is the same class of issue we hit in #8917 where ucfg replaces empty maps with nil during Unpack, causing downstream config rejection.
It's worth noting that #9127 takes the typed struct approach but for constructing new outputs from scratch, whereas this PR needs to selectively inject fields into existing user-provided outputs. I think the typed struct approach is what we want in general, but given the challenges of utilizing ucfg without introducing errors when modifying existing configs, I propose deferring this to a follow-up and sticking with raw ucfg here.
unit-test that proves the nil leaking
func Test_setting(t *testing.T) {
type SSL struct {
Certificate *string `config:"certificate"`
Key *string `config:"key"`
CertificateAuthorities []string `config:"certificate_authorities"`
}
type Output struct {
Type string `config:"type"`
Hosts []string `config:"hosts"`
SSL *SSL `config:"ssl"`
Rest map[string]any `config:",inline"`
}
type Wrapper struct {
Outputs []Output `config:"xpack.fleet.outputs"`
}
// Start with a config that has extra fields (id, name, is_default)
cfg := settings.MustCanonicalConfig(map[string]any{
"xpack.fleet.outputs": []any{
map[string]any{
"type": "elasticsearch",
"hosts": []any{"https://es.default.svc:9200"},
"id": "eck-fleet-agent-output-elasticsearch",
"name": "eck-elasticsearch",
"is_default": true,
"is_default_monitoring": true,
},
map[string]any{
"type": "elasticsearch",
"hosts": []any{"https://es.default.svc:9200"},
"id": "eck-fleet-agent-output-elasticsearch2",
"name": "eck-elasticsearch",
"is_default": true,
"is_default_monitoring": true,
},
},
})
// Unpack into typed struct
var wrapper Wrapper
require.NoError(t, cfg.Unpack(&wrapper))
// Verify unknown fields are captured in Rest
require.Equal(t, "elasticsearch", wrapper.Outputs[0].Type)
require.Contains(t, wrapper.Outputs[0].Rest, "id")
require.Contains(t, wrapper.Outputs[0].Rest, "name")
require.Contains(t, wrapper.Outputs[0].Rest, "is_default")
require.Contains(t, wrapper.Outputs[0].Rest, "is_default_monitoring")
require.Nil(t, wrapper.Outputs[0].SSL)
// Set only cert+key, leave CertificateAuthorities nil
certPath := "/path/to/cert"
_ = certPath
keyPath := "/path/to/key"
wrapper.Outputs[0].SSL = &SSL{Key: &keyPath}
// Remove old outputs and merge replacement
ucfgCfg := (*ucfg.Config)(cfg)
_, err := ucfgCfg.Remove("xpack.fleet.outputs", -1, settings.Options...)
require.NoError(t, err)
require.NoError(t, cfg.MergeWith(settings.MustCanonicalConfig(wrapper)))
render, err := cfg.Render()
require.NoError(t, err)
fmt.Println(string(render))
}PS: if I change Certificate *string config:"certificate"to Certificate string config:"certificate"this will now render as an empty string""` 🙂
There was a problem hiding this comment.
I have improved the url matching in c6da4e1
…nsitiveESClientCertSecrets and have it clean only es transitive client secrets
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.4.0-SNAPSHOT |
| esMutated := esBuilder.DeepCopy().WithMutatedFrom(&esBuilder) | ||
| esMutated.Elasticsearch.Spec.HTTP.TLS.Client.Authentication = false | ||
|
|
||
| esMutatedWrapped := test.WrappedBuilder{ |
There was a problem hiding this comment.
| esMutatedWrapped := test.WrappedBuilder{ | |
| esWithoutLicense := test.WrappedBuilder{ |
There was a problem hiding this comment.
Unless I am missing something, the mutation only disables spec.http.tls.client.authentication on ES - the enterprise license installed by LicenseTestBuilder remains active throughout the mutation phase. IMO renaming to esWithoutLicense would be misleading - happy to rename if you had something else in mind though.
| }, | ||
| }, | ||
| { | ||
| name: "conf is nil returns nil and cleans up", |
There was a problem hiding this comment.
Are we missing one case for the external ref?
|
|
||
| outputCanonical := (*settings.CanonicalConfig)(output) | ||
|
|
||
| if err := outputCanonical.AppendString("ssl.certificate_authorities", caPath); err != nil { |
There was a problem hiding this comment.
nit: Should we make this idempotent? To avoid repeated appending? The client cert version has guards. I know that appending the same CA multiple times is benign here.
There was a problem hiding this comment.
We've been here in the past; specifically, your comment on #9229 - the necessary code to guard against duplicate CAs increases the complexity without any actual gains and as you already mentioned appending the same CA multiple times is benign here. So I'm going to keep it the same just to be consistent.
PS: the client cert version doesn't have uniqueness guards per se, it checks if it's already set by the user otherwise it sets it.
There was a problem hiding this comment.
Not to nit-pick my own nit-pick but it looks like AppendString has in-built idempotency. In any case we are good.
| FleetServerPolicyID = "FLEET_SERVER_POLICY_ID" | ||
| FleetServerServiceToken = "FLEET_SERVER_SERVICE_TOKEN" //nolint:gosec | ||
|
|
||
| // FleetManagedAgentClientCertDir is the stable mount path for client certificates inside |
There was a problem hiding this comment.
Should we add a comment that this works because Fleet server is effectively constrained to a single ES output?
# Conflicts: # docs/reference/api-reference/main.md
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.4.0-SNAPSHOT |
|
buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.4.0-SNAPSHOT |
Update: Fleet Agent mTLS e2e tests - expected long runtime during rolling restartsHeads up on something I investigated regarding the new fleet agent client auth transition tests (
The What's happeningWhen an ES node receives SIGTERM, A Fleet deployment creates ~22 data streams from agent self-monitoring ( I confirmed this with DEBUG engine logging — the shutdown is entirely spent in flush operations across 4 threads spanning ~80 seconds per node. Even with The standalone agent tests are faster because they don't involve Fleet Server + Kibana + fleet-managed agents, so fewer data streams are created and the overall stack is simpler. What I tried locally
As shown here the e2e tests pass but I am unsure about the impact on the CI run time, @pebrc thoughts? 🙂 |
Lets accept the runtime increase in this PR and create dedicated issue that looks at our e2e tests holistically to see how we can speed up the runs. |
ok I created this issue |
Summary
This PR implements client certificate support for Fleet Server and Elastic Agent (both fleet-managed and standalone) when connecting to an Elasticsearch that has client authentication enabled. This PR handles direct ES associations and transitive associations (fleet-managed agents connecting to ES through Fleet Server).
Relates to #9081
Changes
ElasticsearchSelector: ChangesOutput.ObjectSelectortoOutput.ElasticsearchSelector, addingclientCertificateSecretNamesupport to Agent'selasticsearchRefsTransitiveESRefinAssociationConf: New struct to propagate transitive Elasticsearch association state (e.g., the client cert secret name for the ES that Fleet Server connects to) through to fleet-managed agentsELASTICSEARCH_CERT/ELASTICSEARCH_CERT_KEYenvironment variables for standalone agents with direct ES associationsFLEET_SERVER_ELASTICSEARCH_CERT/FLEET_SERVER_ELASTICSEARCH_CERT_KEYfor Fleet Server pods, and client cert volumes for fleet-managed agentspopulateFleetServerESConfig) to reduce nesting complexityssl.certificateandssl.keyinto Kibana'sxpack.fleet.outputsfor fleet-managed agent policies when mTLS is enabled. Similarly it does the same forssl.certificate_authoritieswhen a CA is provided.API