Skip to content

Implement fleet-server and elastic-agent support for presenting client certificates to elasticsearch#9234

Merged
pkoutsovasilis merged 15 commits intomainfrom
feat/mtls_fleet_elasticagent
Apr 22, 2026
Merged

Implement fleet-server and elastic-agent support for presenting client certificates to elasticsearch#9234
pkoutsovasilis merged 15 commits intomainfrom
feat/mtls_fleet_elasticagent

Conversation

@pkoutsovasilis
Copy link
Copy Markdown
Contributor

@pkoutsovasilis pkoutsovasilis commented Mar 16, 2026

Summary

This PR implements client certificate support for Fleet Server and Elastic Agent (both fleet-managed and standalone) when connecting to an Elasticsearch that has client authentication enabled. This PR handles direct ES associations and transitive associations (fleet-managed agents connecting to ES through Fleet Server).

Relates to #9081

Changes

  • Agent ElasticsearchSelector: Changes Output.ObjectSelector to Output.ElasticsearchSelector, adding clientCertificateSecretName support to Agent's elasticsearchRefs
  • TransitiveESRef in AssociationConf: New struct to propagate transitive Elasticsearch association state (e.g., the client cert secret name for the ES that Fleet Server connects to) through to fleet-managed agents
  • Fleet Server association controller: Extended to reconcile client certificates for the transitive ES association - when a fleet-managed agent associates with a Fleet Server, the controller looks up the Fleet Server's ES association conf and, if client authentication is required, creates a client certificate secret in the agent's namespace
  • Orphaned secret cleanup: Deletes stale transitive client certificate secrets when the Fleet Server's ES association changes or client authentication is disabled
  • Standalone Agent pod spec: Mounts client certificate volumes and configures ELASTICSEARCH_CERT / ELASTICSEARCH_CERT_KEY environment variables for standalone agents with direct ES associations
  • Fleet-managed Agent pod spec: Mounts the transitive client certificate secret and configures FLEET_SERVER_ELASTICSEARCH_CERT / FLEET_SERVER_ELASTICSEARCH_CERT_KEY for Fleet Server pods, and client cert volumes for fleet-managed agents
  • Fleet Server env vars: Extracted ES config population into a helper function (populateFleetServerESConfig) to reduce nesting complexity
  • Kibana Fleet output injection: Injects ssl.certificate and ssl.key into Kibana's xpack.fleet.outputs for fleet-managed agent policies when mTLS is enabled. Similarly it does the same for ssl.certificate_authorities when a CA is provided.

API

  apiVersion: agent.k8s.elastic.co/v1alpha1
  kind: Agent
  spec:
    elasticsearchRefs:
      - name: elasticsearch
        outputName: default
        # Optional: use a custom client certificate
        clientCertificateSecretName: my-custom-client-cert

@pkoutsovasilis pkoutsovasilis self-assigned this Mar 16, 2026
@pkoutsovasilis pkoutsovasilis added the >feature Adds or discusses adding a feature to the product label Mar 16, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 16, 2026

✅ Vale Linting Results

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@prodsecmachine
Copy link
Copy Markdown
Collaborator

prodsecmachine commented Mar 16, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch 2 times, most recently from b063c40 to bae25c6 Compare March 16, 2026 16:23
@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.1,s=8.19.2,s=9.2.6

@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch from a1e1dda to 9281356 Compare March 17, 2026 08:18
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from bae25c6 to 7f3be6d Compare March 17, 2026 08:18
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch from 9281356 to 9cea7f4 Compare March 17, 2026 12:25
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 7f3be6d to 2d54323 Compare March 17, 2026 12:25
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch from 9cea7f4 to 2ebd22b Compare March 18, 2026 09:55
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 2d54323 to 39f0764 Compare March 18, 2026 09:56
@pkoutsovasilis pkoutsovasilis changed the title implement support for presenting client certificates to fleet-server and elastic-agent implement fleet-server and elastic-agent support for presenting client certificates to elasticsearch Mar 18, 2026
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch from 2ebd22b to 2f6b198 Compare March 18, 2026 10:12
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 39f0764 to 0bf556c Compare March 18, 2026 10:19
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch from 2f6b198 to 0191bcc Compare March 19, 2026 15:31
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 0bf556c to 2a707bb Compare March 19, 2026 15:31
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch 2 times, most recently from faf43c5 to 308d97a Compare March 22, 2026 12:55
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 2a707bb to 5d29636 Compare March 22, 2026 16:44
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_association_kibana branch from 308d97a to d1e6a75 Compare April 1, 2026 09:58
Base automatically changed from feat/mtls_association_kibana to main April 2, 2026 07:30
@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 5d29636 to 65b9ae2 Compare April 2, 2026 10:27
@pkoutsovasilis pkoutsovasilis changed the title implement fleet-server and elastic-agent support for presenting client certificates to elasticsearch Implement fleet-server and elastic-agent support for presenting client certificates to elasticsearch Apr 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

🔍 Preview links for changed docs

@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 65b9ae2 to 098f7f6 Compare April 2, 2026 10:44
@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.2,s=8.19.13,s=9.2.7

@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 098f7f6 to f253875 Compare April 2, 2026 16:59
@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.2,s=8.19.13,s=9.2.7

@pkoutsovasilis pkoutsovasilis force-pushed the feat/mtls_fleet_elasticagent branch from 5e66fd4 to 267d115 Compare April 9, 2026 04:35
@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.2.8

return nil, nil
}

ref := esAssociation.AssociationRef()
Copy link
Copy Markdown
Contributor

@kvalliyurnatt kvalliyurnatt Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore if this does not make sense, will this ref be a namespaced Name of Elasticsearch object or can it also return a secret (external reference, unmanaged Elasticsearch connection secret) ? asking because below we assume this returns an Elasticsearch object

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!! yes that was a gap and I fixed it in 4a3c2bc thx

pebrc
pebrc previously approved these changes Apr 10, 2026
Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds a well-structured mTLS client certificate system for fleet-server and elastic-agent connections to Elasticsearch. The core certificate lifecycle (generation, rotation, trust bundle assembly, cleanup) builds cleanly on top of the existing certificates.Reconciler infrastructure. The transitive association chain (Agent → Fleet Server → ES) is handled through a clear callback mechanism (ReconcileTransitiveESSecrets) that avoids coupling the generic association reconciler to agent-specific logic.

Integration testing on a live cluster confirmed all three main paths work correctly: standalone agent with managed certs, fleet-managed agent with transitive cert propagation + Kibana ssl injection, and clean teardown when client auth is disabled.

Key suggestions:

  1. External ES ref safetyfleetManagedAgentTransitiveESRef does a c.Get on the ES NamespacedName without checking ref.IsExternal(), which would fail for external ES references. (medium)

  2. Kibana config injection style — The injectFleetOutput* functions use raw ucfg manipulation rather than the typed struct Unpack/MergeWith pattern used everywhere else in config_settings.go. This works but is inconsistent and harder to maintain. (low)

  3. URL matchingoutputHostsContain uses exact string equality. A trailing slash or any other equivalent-but-not-identical URL form would silently skip injection with no log output. (low)

Comment thread pkg/controller/association/controller/agent_fleetserver.go
Comment thread pkg/controller/association/controller/agent_fleetserver.go Outdated
Comment thread pkg/controller/agent/pod.go Outdated
Comment thread pkg/controller/association/controller/agent_fleetserver.go
Comment on lines +421 to +456
func injectFleetOutputClientCerts(cfg *settings.CanonicalConfig, esURL string) error {
ucfgCfg := (*ucfg.Config)(cfg)
opts := settings.Options

for i := 0; ; i++ {
output, childErr := ucfgCfg.Child("xpack.fleet.outputs", i, opts...)
if childErr != nil {
// childErr signals end of indexed children, not a real error
break
}

outputType, err := output.String("type", -1, opts...)
if err != nil || outputType != "elasticsearch" {
continue
}

if !outputHostsContain(output, esURL) {
continue
}

if has, _ := output.Has("ssl.certificate", -1, opts...); !has {
if err := output.SetString("ssl.certificate", -1, path.Join(agent.FleetManagedAgentClientCertDir,
certificates.CertFileName), opts...); err != nil {
return err
}
}
if has, _ := output.Has("ssl.key", -1, opts...); !has {
if err := output.SetString("ssl.key", -1, path.Join(agent.FleetManagedAgentClientCertDir,
certificates.KeyFileName), opts...); err != nil {
return err
}
}
}

return nil //nolint:nilerr
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (style): Both injectFleetOutputClientCerts and injectFleetOutputCertificateAuthorities use raw (*ucfg.Config) pointer casts with manual Child()/Has()/SetString() calls. This is the only place in config_settings.go that manipulates ucfg at this level — every other config section uses the idiomatic pattern of typed Go structs + Unpack/MustCanonicalConfig/MergeWith.

Consider aligning with the existing pattern, e.g.:

type fleetOutputSSL struct {
    Certificate            string   `config:"certificate,omitempty"`
    Key                    string   `config:"key,omitempty"`
    CertificateAuthorities []string `config:"certificate_authorities,omitempty"`
}

type fleetOutput struct {
    Type  string          `config:"type"`
    Hosts []string        `config:"hosts"`
    SSL   *fleetOutputSSL `config:"ssl,omitempty"`
}

type fleetOutputsWrapper struct {
    Outputs []fleetOutput `config:"xpack.fleet.outputs"`
}

Then unpack, mutate, merge back. This would also make the URL matching more obvious and testable on plain Go structs, and would be consistent with the approach PR #9127 already takes for its fleet output structs.

Copy link
Copy Markdown
Contributor Author

@pkoutsovasilis pkoutsovasilis Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the typed approach is gonna work out of the box here. Specifically when round-tripping through unpack → mutate → MergeWith, nil pointer fields leak as null and nil slices as [] in the rendered YAML. This is the same class of issue we hit in #8917 where ucfg replaces empty maps with nil during Unpack, causing downstream config rejection.

It's worth noting that #9127 takes the typed struct approach but for constructing new outputs from scratch, whereas this PR needs to selectively inject fields into existing user-provided outputs. I think the typed struct approach is what we want in general, but given the challenges of utilizing ucfg without introducing errors when modifying existing configs, I propose deferring this to a follow-up and sticking with raw ucfg here.

unit-test that proves the nil leaking

func Test_setting(t *testing.T) {
	type SSL struct {
		Certificate            *string  `config:"certificate"`
		Key                    *string  `config:"key"`
		CertificateAuthorities []string `config:"certificate_authorities"`
	}

	type Output struct {
		Type  string         `config:"type"`
		Hosts []string       `config:"hosts"`
		SSL   *SSL           `config:"ssl"`
		Rest  map[string]any `config:",inline"`
	}

	type Wrapper struct {
		Outputs []Output `config:"xpack.fleet.outputs"`
	}

	// Start with a config that has extra fields (id, name, is_default)
	cfg := settings.MustCanonicalConfig(map[string]any{
		"xpack.fleet.outputs": []any{
			map[string]any{
				"type":                  "elasticsearch",
				"hosts":                 []any{"https://es.default.svc:9200"},
				"id":                    "eck-fleet-agent-output-elasticsearch",
				"name":                  "eck-elasticsearch",
				"is_default":            true,
				"is_default_monitoring": true,
			},
			map[string]any{
				"type":                  "elasticsearch",
				"hosts":                 []any{"https://es.default.svc:9200"},
				"id":                    "eck-fleet-agent-output-elasticsearch2",
				"name":                  "eck-elasticsearch",
				"is_default":            true,
				"is_default_monitoring": true,
			},
		},
	})

	// Unpack into typed struct
	var wrapper Wrapper
	require.NoError(t, cfg.Unpack(&wrapper))

	// Verify unknown fields are captured in Rest
	require.Equal(t, "elasticsearch", wrapper.Outputs[0].Type)
	require.Contains(t, wrapper.Outputs[0].Rest, "id")
	require.Contains(t, wrapper.Outputs[0].Rest, "name")
	require.Contains(t, wrapper.Outputs[0].Rest, "is_default")
	require.Contains(t, wrapper.Outputs[0].Rest, "is_default_monitoring")
	require.Nil(t, wrapper.Outputs[0].SSL)

	// Set only cert+key, leave CertificateAuthorities nil
	certPath := "/path/to/cert"
	_ = certPath
	keyPath := "/path/to/key"
	wrapper.Outputs[0].SSL = &SSL{Key: &keyPath}

	// Remove old outputs and merge replacement
	ucfgCfg := (*ucfg.Config)(cfg)
	_, err := ucfgCfg.Remove("xpack.fleet.outputs", -1, settings.Options...)
	require.NoError(t, err)
	require.NoError(t, cfg.MergeWith(settings.MustCanonicalConfig(wrapper)))

	render, err := cfg.Render()
	require.NoError(t, err)
	fmt.Println(string(render))
}

PS: if I change Certificate *string config:"certificate"to Certificate string config:"certificate"this will now render as an empty string""` 🙂

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have improved the url matching in c6da4e1

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.4.0-SNAPSHOT

@pkoutsovasilis pkoutsovasilis requested a review from pebrc April 15, 2026 09:16
Comment thread pkg/controller/association/reconciler.go
@pkoutsovasilis pkoutsovasilis dismissed pebrc’s stale review April 15, 2026 14:49

implementation has changed

Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🤞

esMutated := esBuilder.DeepCopy().WithMutatedFrom(&esBuilder)
esMutated.Elasticsearch.Spec.HTTP.TLS.Client.Authentication = false

esMutatedWrapped := test.WrappedBuilder{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
esMutatedWrapped := test.WrappedBuilder{
esWithoutLicense := test.WrappedBuilder{

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am missing something, the mutation only disables spec.http.tls.client.authentication on ES - the enterprise license installed by LicenseTestBuilder remains active throughout the mutation phase. IMO renaming to esWithoutLicense would be misleading - happy to rename if you had something else in mind though.

},
},
{
name: "conf is nil returns nil and cleans up",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing one case for the external ref?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, added in 5f4e32d


outputCanonical := (*settings.CanonicalConfig)(output)

if err := outputCanonical.AppendString("ssl.certificate_authorities", caPath); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we make this idempotent? To avoid repeated appending? The client cert version has guards. I know that appending the same CA multiple times is benign here.

Copy link
Copy Markdown
Contributor Author

@pkoutsovasilis pkoutsovasilis Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been here in the past; specifically, your comment on #9229 - the necessary code to guard against duplicate CAs increases the complexity without any actual gains and as you already mentioned appending the same CA multiple times is benign here. So I'm going to keep it the same just to be consistent.

PS: the client cert version doesn't have uniqueness guards per se, it checks if it's already set by the user otherwise it sets it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to nit-pick my own nit-pick but it looks like AppendString has in-built idempotency. In any case we are good.

FleetServerPolicyID = "FLEET_SERVER_POLICY_ID"
FleetServerServiceToken = "FLEET_SERVER_SERVICE_TOKEN" //nolint:gosec

// FleetManagedAgentClientCertDir is the stable mount path for client certificates inside
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment that this works because Fleet server is effectively constrained to a single ES output?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment in 68315d5

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.4.0-SNAPSHOT

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestClientAuthRequired.*,E2E_TAGS=agent -m s=9.3.3,s=8.19.14,s=9.4.0-SNAPSHOT

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

Update: Fleet Agent mTLS e2e tests - expected long runtime during rolling restarts

Heads up on something I investigated regarding the new fleet agent client auth transition tests (TestClientAuthRequiredTransition_FleetAgent).

CI results

Test Time
TestClientAuthRequiredCustomCertificate_FleetAgent 6m57s
TestClientAuthRequiredCustomCertificate_StandaloneAgent 2m19s
TestClientAuthRequiredTransition_FleetAgent -> 12m18s <-
TestClientAuthRequiredTransition_StandaloneAgent 4m19s

The TestClientAuthRequiredTransition_FleetAgent test is significantly slower than the other at ~12 minutes. AFAICT this is not due to anything in this PR — it's due to how ES handles JVM shutdown with many indices during the rolling restart triggered by the mTLS → no-mTLS mutation.

What's happening

When an ES node receives SIGTERM, IndicesService.doStop() flushes and closes all local shards using a fixed 5-thread pool. Each shard calls engine.flushAndClose() which commits translog data to Lucene segments and fsyncs to disk sequentially within each index.

A Fleet deployment creates ~22 data streams from agent self-monitoring (logs-elastic_agent.*, metrics-elastic_agent.*, metrics-system.*, etc.), plus Kibana system indices and alert indices - totaling ~71 indices. Each ES node hosts ~47 shards that must all be flushed on shutdown.

I confirmed this with DEBUG engine logging — the shutdown is entirely spent in flush operations across 4 threads spanning ~80 seconds per node. Even with PRE_STOP_ADDITIONAL_WAIT_SECONDS=0 (as e2e tests set), the ES JVM shutdown alone takes ~130s per node — and on subsequent pods it reaches the 180s terminationGracePeriodSeconds limit, meaning pods are being SIGKILLed rather than shutting down gracefully. With 3 pods restarting sequentially (each waiting for the previous to rejoin and the cluster to go green), the rolling restart portion alone accounts for several minutes.

The standalone agent tests are faster because they don't involve Fleet Server + Kibana + fleet-managed agents, so fewer data streams are created and the overall stack is simpler.

What I tried locally

  • Enabling mmapfs (removing allow_mmap: false): No difference - the bottleneck is not storage I/O type, it's the sequential shard flush.
  • Reducing index count by disabling monitoring_enabled on Fleet agent policies: Tried this to see if fewer indices would reduce shutdown time. The idea is sound (fewer shards to flush = faster shutdown), but Fleet defaults monitoring to enabled when the field is omitted - it needs to be explicitly set to monitoring_enabled: []. Also, removing monitoring means WithFleetAgentDataStreamsValidation() can't be used, so validation needs to target other data streams (system.cpu from the system package).

As shown here the e2e tests pass but I am unsure about the impact on the CI run time, @pebrc thoughts? 🙂

@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Apr 22, 2026

As shown here the e2e tests pass but I am unsure about the impact on the CI run time, @pebrc thoughts? 🙂

Lets accept the runtime increase in this PR and create dedicated issue that looks at our e2e tests holistically to see how we can speed up the runs.

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

Lets accept the runtime increase in this PR and create dedicated issue that looks at our e2e tests holistically to see how we can speed up the runs.

ok I created this issue

@pkoutsovasilis pkoutsovasilis merged commit 8379ed2 into main Apr 22, 2026
9 checks passed
@pkoutsovasilis pkoutsovasilis deleted the feat/mtls_fleet_elasticagent branch April 22, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature Adds or discusses adding a feature to the product v3.5.0 (next)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants