automatic troubleshooting endpoint context docs#728

Open

joeypoon wants to merge 4 commits intomainfrom

feature/at-sdh-context-docs-v2

Member

joeypoon commented Mar 13, 2026

Change Summary

Adds Automatic Troubleshooting context docs useful for troubleshooting common endpoint issues.

joeypoon added 4 commits

March 9, 2026 19:54


          add high_cpu, missed_checkins, bsod, trustedapps context docs

79ea268


          add endpoint_exceptions and incompatible_software context docs

d4c162e


          add output_config context doc for Kafka/Logstash output troubleshooting

537edfe


          add device_control context doc for notification and serial number issues

c0a6c43

joeypoon requested a review from a team as a code owner

March 13, 2026 01:15

joeypoon requested review from pzl and tomsonpl

March 13, 2026 01:15

pzl approved these changes

View reviewed changes

ferullo reviewed

View reviewed changes

Contributor

ferullo left a comment

There's a lot to review here so I'm going to submit a review for each file. The first one is done. It's reasonable/expected to sit on my recommendations until an Endpoint developer reviews too (so they can see my comments and contradict me, etc).

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		## Summary

		Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.

Contributor

ferullo Mar 18, 2026

Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures.

Does an LLM need to be told that?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.

		Memory dump analysis with WinDbg (`!analyze -v`) is essential for root-cause determination. The bugcheck code alone is not sufficient — the faulting call stack identifies which code path triggered the crash.

Contributor

ferullo Mar 18, 2026

I don't expect customers to do this. Perhaps text that describes how to collect a memory dump and a note that says to share it with us?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		### ODX-enabled volume crash (8.19.8, 9.1.8, 9.2.2)

		A regression introduced in versions 8.19.8, 9.1.8, and 9.2.2 causes BSODs on systems with ODX (Offloaded Data Transfer) enabled volumes, particularly affecting Hyper-V clusters and Windows Server 2016 Datacenter. The crash occurs in the file system filter driver's post-FsControl handler (`bkPostFsControl`) when processing offloaded write completions. The faulting call stack typically shows `elastic_endpoint_driver!bkPostFsControl` followed by `FLTMGR!FltGetStreamContext`.

Contributor

ferullo Mar 18, 2026

he crash occurs in the file system filter driver's post-FsControl handler (bkPostFsControl) when processing offloaded write completions. The faulting call stack typically shows elastic_endpoint_driver!bkPostFsControl followed by FLTMGR!FltGetStreamContext.

I doubt this is useful info for a Kibana user.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		A regression in the network driver introduced in Elastic Defend versions 8.17.8, 8.18.3, and 9.0.3 can cause kernel pool corruption on systems with a large number of long-lived network connections that remain inactive for 30+ minutes. The corruption manifests as BSODs with various bugcheck codes including `IRQL_NOT_LESS_OR_EQUAL`, `SYSTEM_SERVICE_EXCEPTION`, `KERNEL_MODE_HEAP_CORRUPTION`, or `PAGE_FAULT_IN_NONPAGED_AREA`.

		This is the most frequently reported BSOD pattern and affects Windows Server environments with persistent connections (e.g. database servers, backup servers running Veeam with PostgreSQL). The kernel pool may already be corrupted when the driver attempts a routine memory allocation, causing the crash to appear in unrelated code paths.

Contributor

ferullo Mar 18, 2026

I doubt this paragraph is useful info for a Kibana user.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		Fixed version: 9.2.4.

		Mitigation: Downgrade to the prior agent version (e.g. 9.2.1) until the upgrade to 9.2.4+ can be performed.

Contributor

ferullo Mar 18, 2026

Agent doesn't support downgrade (or am I mistaken). This should say "Upgrade to a version with the fix" I think.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		Other security products running kernel-mode drivers can interfere with Elastic Defend's driver initialization or runtime operation. The most commonly reported conflicts include:

		- Trellix Access Control: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).

Contributor

ferullo Mar 18, 2026

Suggested change

      
            - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).
          
            - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. This interaction was introduced by an Elastic Defend refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).

Question for Defend developers: is "add a Trellix exclusion for the BFE service (svchost.exe)." an acceptable recommendation?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		- CrowdStrike, Kaspersky, Windows Defender coexistence: Running multiple endpoint security products increases the probability of kernel-level interactions. Each additional kernel-mode filter driver introduces another point of contention for file system, registry, and network callbacks. When BSODs occur on systems with multiple security products, simplify by removing redundant products.

		- High third-party driver count: Systems with an unusually high number of third-party kernel drivers (e.g. 168+ drivers) amplify the risk of pool corruption being attributed to or triggered by any one driver. Enable Driver Verifier on suspect third-party drivers to isolate the true source.

Contributor

ferullo Mar 18, 2026

IDK, is this actionable by users?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		### Unsupported OS version

		Upgrading Elastic Defend to a version that dropped support for the host's Windows version causes immediate BSODs or boot loops. The most common case is upgrading to 8.13+ on Windows Server 2012 R2, which lost support in that release. The system crashes during driver load because the driver uses kernel APIs unavailable on the older OS.

Contributor

ferullo Mar 18, 2026

Something is missing here, we added support for Windows Server 2012 R2 back in 8.16.0

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		In some cases the endpoint driver causes a system deadlock rather than a classic BSOD. The system becomes completely unresponsive — applications freeze, Task Manager hangs, and the Elastic service cannot be stopped. This typically requires a hard reboot. A kernel memory dump captured during the lockup (via keyboard-initiated crash: right Ctrl + Scroll Lock twice) is required for diagnosis.

		This pattern has been observed when the driver's file system filter processing enters a long-running or blocking state while monitoring specific applications. If the lockup is reproducible with a specific application, adding that application as a Trusted Application may resolve the conflict.

Contributor

ferullo Mar 18, 2026

This paragraph doesn't seem useful to a user.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		## Investigation priorities

		1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.

Contributor

ferullo Mar 18, 2026

Suggested change

      
            1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.
          
            1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Share the dump with Elastic.

ferullo reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/device_control/device_control_notification.md


		## Symptom

		A custom notification message has been configured in the Elastic Defend Device Control policy to display when a USB device is blocked, but the Windows system tray popup does not appear. Instead, the user sees only a generic Windows Explorer error stating the device is not accessible. Alternatively, device-specific allow/block rules based on `device.serial_number` do not match the intended device because the serial number field contains `0` or a seemingly random value.

Contributor

ferullo Mar 18, 2026

These two situations are completely different. Should a single MD doc cover them both or is it better to break this doc up? I'm holding off on reading this file until this is answered.

I have no personal preference, I'm just bringing this up in case it helps with context windows.

nicholasberlin reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		## Summary

		Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.

Contributor

nicholasberlin Mar 19, 2026

Suggested change

      
            Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.
          
            Elastic Defend on Linux uses eBPF or tracefs to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		CPU returns to normal within approximately 40 seconds after connectivity is restored. The command `sudo /opt/Elastic/Endpoint/elastic-endpoint test output` can be used to verify output connectivity — on affected versions this command itself will spike CPU when the output is unreachable.

		Upgrade to 8.13.4+ where the retry loop includes proper backoff. Check `logs-elastic_agent.endpoint_security-*` for the error patterns above.

Contributor

nicholasberlin Mar 19, 2026

8.13.4 is really old, do we need to call this out?

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+              The endpoint will report status as CONFIGURING during this time:
+              - `Endpoint is setting status to CONFIGURING, reason: Policy Application Status`
+              On first run with an empty cache, the CONFIGURING phase can take 5–30 minutes depending on the number and size of running processes. This is expected behavior. Subsequent restarts are fast because the cache persists.

Contributor

nicholasberlin Mar 19, 2026

5-30 minutes? is that supposed to be seconds?

Contributor

nicholasberlin Mar 19, 2026

I suppose not, because 30 seconds wouldn't be a problem.

ferullo reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md


		## Symptom

		An Endpoint Alert Exception has been created in Elastic Defend, but the endpoint continues to generate alerts or block processes that should be excluded. Alert documents matching the exception criteria still appear in `logs-endpoint.alerts-*`.

Contributor

ferullo Mar 19, 2026

We should add another doc for when Alert documents do not appear in logs-endpoint.alerts-*

A shipping error from Endpoint
A configuration error in Kafka or Logstash
Actually Endpoint isn't actively blocking activity, it's caused by some sort of interaction with another AV or process -- targeted Trusted Apps is the most common solution

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md


		### Exception scoped to wrong integration policy

		Endpoint Alert Exceptions can be scoped to specific integration policies ("per policy") or applied globally. If the exception is assigned to a policy that does not cover the affected endpoint, it has no effect on that endpoint.

Contributor

ferullo Mar 19, 2026

This is not true. We plan to ship per-policy exceptions in 9.4 and that'll be opt-in.

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md


		### Exception on wrong cluster in cross-cluster search (CCS) setups

		In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.

Contributor

ferullo Mar 19, 2026

Suggested change

      
            In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.
          
            In some architectures Fleet-managed agents are managed via an "analyst cluster" but write bulk data to a "data cluster".. In these architectures, analysts query data via CCS from the data cluster, Endpoint Alert Exceptions must be created on the analyst cluster (the one running Fleet Server). Exceptions created on the data cluster are not distributed to endpoints because CCS does not support Fleet Actions.

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md


		In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.

		If exceptions exist only on the analyst cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.

Contributor

ferullo Mar 19, 2026

Suggested change

      
            If exceptions exist only on the analyst cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.
          
            If exceptions exist only on the data cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md

Comment on lines +67 to +73

+              ### Exception not suppressing Kibana detection rule alerts
+              Endpoint Alert Exceptions suppress alerts generated by the endpoint itself — they prevent the endpoint from creating alert documents in `logs-endpoint.alerts-*`. However, if a Kibana detection rule (e.g. the prebuilt "Malware Detection Alert" rule) is configured to fire on endpoint alert documents, that Kibana rule generates its own separate alert documents.
+              If the endpoint exception is working correctly (no new documents in `logs-endpoint.alerts-*`), but alerts still appear in the Kibana Alerts page, the source is the Kibana detection rule, not the endpoint. In this case, add a Detection Rule Exception on the Kibana rule itself, or disable the redundant detection rule.
+              To distinguish: check the `kibana.alert.rule.rule_type_id` field. Endpoint-generated alerts have `event.module: endpoint` and `event.dataset: endpoint.alerts`. Kibana detection rule alerts have fields like `kibana.alert.rule.category: "Custom Query Rule"` and `kibana.alert.rule.producer: "siem"`.

Contributor

ferullo Mar 19, 2026

This whole section makes no sense. It's not possible for the SIEM rule to create an alert if there is no underlying Endpoint alert in logs-endpoint.alerts-*.

Maybe we mean to say alerts created by most SIEM rules need SIEM exceptions not Endpoint exceptions?

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md


		To suppress on Elastic Defend 9.2+: add the process as a Trusted Application (this disables behavioral detections for the process entirely) OR add an Endpoint Alert Exception targeting the specific rule ID and process if you want to suppress only that one rule.

		To suppress on pre-9.2: Trusted Applications do not suppress behavioral detections on these older versions. Use an Endpoint Alert Exception targeting `rule.id` and the relevant process fields.

Contributor

ferullo Mar 19, 2026

Suggested change

      
            To suppress on pre-9.2: Trusted Applications do not suppress behavioral detections on these older versions. Use an **Endpoint Alert Exception** targeting `rule.id` and the relevant process fields.

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

Comment on lines +66 to +67

		- A third-party driver or application causes memory corruption that makes non-executable memory pages appear executable, leading to unexpected scanning of heap regions.
		- Faulty RAM causes bit flips in page table entries, changing page protection bits and exposing non-code memory to the scanner.

Contributor

ferullo Mar 19, 2026

Suggested change

      
            - A third-party driver or application causes memory corruption that makes non-executable memory pages appear executable, leading to unexpected scanning of heap regions.
          
            - Faulty RAM causes bit flips in page table entries, changing page protection bits and exposing non-code memory to the scanner.

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

Comment on lines +73 to +79

+              ### Alert field data corrupted by product bug
+              In rare cases, the endpoint may populate alert fields with corrupted data (e.g. truncated `dll.path` or `dll.name` values due to kernel-level memory corruption from a third-party driver). When the corrupted field values happen to match a behavioral detection rule's criteria — for example, a DLL path truncated to appear as if it has an unusual extension — the rule fires on data that does not reflect the actual system state.
+              These are product bugs, not fixable via exceptions in the general case. However, a targeted Endpoint Alert Exception can suppress the specific false-positive pattern as a workaround. For example, suppressing a rule when the DLL's code signature status indicates it was never actually loaded (`dll.code_signature.status: "errorCode_endpoint: Initital state, no attempt to load signature was made"`).
+              If you encounter alerts where field values appear truncated or nonsensical, collect agent diagnostics and the full alert document for Elastic Support. Check for third-party kernel drivers that may be causing memory corruption — the Windows Driver Verifier can help identify misbehaving drivers on non-production systems.

Contributor

ferullo Mar 23, 2026

This was a very edge case, I think we should remove it. It won't generalize well.

Suggested change

      
            ### Alert field data corrupted by product bug
          
            In rare cases, the endpoint may populate alert fields with corrupted data (e.g. truncated `dll.path` or `dll.name` values due to kernel-level memory corruption from a third-party driver). When the corrupted field values happen to match a behavioral detection rule's criteria — for example, a DLL path truncated to appear as if it has an unusual extension — the rule fires on data that does not reflect the actual system state.
          
            These are product bugs, not fixable via exceptions in the general case. However, a targeted Endpoint Alert Exception can suppress the specific false-positive pattern as a workaround. For example, suppressing a rule when the DLL's code signature status indicates it was never actually loaded (`dll.code_signature.status: "errorCode_endpoint: Initital state, no attempt to load signature was made"`).
          
            If you encounter alerts where field values appear truncated or nonsensical, collect agent diagnostics and the full alert document for Elastic Support. Check for third-party kernel drivers that may be causing memory corruption — the Windows Driver Verifier can help identify misbehaving drivers on non-production systems.

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md


		The Elastic Defend blocklist prevents known-malicious binaries from executing. However, if a process matches the Global Exception List before the blocklist is evaluated, the blocklist entry is skipped. The Global Exception List contains entries for major trusted software publishers (e.g. Microsoft). This means binaries signed by these vendors cannot be blocked via the user blocklist — the global exception takes priority.

		This is a known product limitation. If the goal is to prevent specific trusted-signed binaries from running, the blocklist is not the correct mechanism. Consider using operating system-level application control policies (e.g. Windows AppLocker or WDAC) instead.

Contributor

ferullo Mar 23, 2026

Suggested change

      
            This is a known product limitation. If the goal is to prevent specific trusted-signed binaries from running, the blocklist is not the correct mechanism. Consider using operating system-level application control policies (e.g. Windows AppLocker or WDAC) instead.

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md


		## Investigation priorities

		1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware engine), `behavior` (behavioral protection), `memory_signature` (memory scanning). This determines which artifact type to use for suppression.

Contributor

ferullo Mar 23, 2026

Suggested change

      
            1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware engine), `behavior` (behavioral protection), `memory_signature` (memory scanning). This determines which artifact type to use for suppression.
          
            1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware protection), `behavior` (malicious behavior protection), `memory_signature` (memory threat protection), `ransomware` (ransomware protection). This determines which artifact type to use for suppression.

ferullo reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		Monitoring and automation scripts that run on a schedule (cron, systemd timers) and spawn many child processes are the most common cause of high CPU on Linux. A single monitoring script invoking `curl`, `mysql`, `ssh`, `grep`, `sed`, `awk`, and `bash` in rapid succession generates a burst of process creation events, each of which Elastic Defend must enrich and evaluate against behavioral rules.

		A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).

Contributor

ferullo Mar 23, 2026

Suggested change

      
            A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).
          
            A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour).

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).

		Adding the parent script as a Trusted Application stops monitoring of its process tree but does not prevent behavioral rules from firing if the rule matches on child process characteristics. On versions prior to 9.2, behavioral detections still fire for trusted processes. On 9.2+, behavioral detections are disabled for Trusted Applications.

Contributor

ferullo Mar 23, 2026

@nicholasberlin Trusted Application Descendants or Event Filter Descendants will help here, right?

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+              - Create an **Endpoint Alert Exception** targeting the specific rule ID and parent process:
+                - `rule.id IS <rule-id>` AND `process.parent.executable IS /path/to/script`
+              - Upgrade to 8.19.11+ or 9.2+ for improved handling of trusted process behavioral rules.
+              - Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).

Contributor

ferullo Mar 23, 2026

Suggested change

      
            - Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).
          
            - Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if malicious behavior rules based on DNS monitoring is not needed).

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+                - `rule.id IS <rule-id>` AND `process.parent.executable IS /path/to/script`
+              - Upgrade to 8.19.11+ or 9.2+ for improved handling of trusted process behavioral rules.
+              - Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).
+              - Use **Event Filters** to reduce event volume from known-noisy directories without creating a monitoring blind spot.

Contributor

ferullo Mar 23, 2026

Suggested change

      
            - Use **Event Filters** to reduce event volume from known-noisy directories without creating a monitoring blind spot.
          
            - Use **Event Filters** to reduce event volume from known-noisy directories without creating a active protection blind spot.

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		### Events plugin hung hashing large binaries during policy application

		During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can cause the Events plugin to hang in `ConfigurationCallback` while performing SHA1 hashing, driving CPU to 100% for extended periods.

Contributor

ferullo Mar 23, 2026

Suggested change

      
            During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can cause the Events plugin to hang in `ConfigurationCallback` while performing SHA1 hashing, driving CPU to 100% for extended periods.
          
            During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can drive CPU to 100% for extended periods.

tomsonpl approved these changes

View reviewed changes

Contributor

tomsonpl left a comment

Looks alright, left a few questions just for clarity sake :)

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+              ---
+              type: automatic_troubleshooting
+              sub_type: trusted_apps
+              link: https://www.elastic.co/docs/solutions/security/manage-elastic-defend/optimize-elastic-defend

Contributor

tomsonpl Mar 24, 2026

I see that only 3 files contain the link field, just checking if this is intentional?

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+              ---
+              type: automatic_troubleshooting
+              sub_type: incompatible_software
+              date: '2026-03-11'

Contributor

tomsonpl Mar 24, 2026

no OS field here?

ferullo reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md


		## Summary

		Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the rules engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.

Contributor

ferullo Mar 24, 2026

Suggested change

      
            Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the rules engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.
          
            Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the Malicious Behavior engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md


		Elastic Defend hashes and verifies digital signatures of executables and DLLs when they are loaded. Large binaries like `msedge.dll` (195 MB) can take 10–15 seconds per hash operation. On hosts running browsers, Office applications, or developer tools that load many large DLLs, this creates sustained CPU spikes.

		The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.

Contributor

ferullo Mar 24, 2026

pre 8.0.1 is so old it is no longer supported.

Suggested change

      
            The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md


		The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.

		For immediate relief on older versions, set `windows.advanced.kernel.asyncimageload` and `windows.advanced.kernel.syncimageload` to `false` in advanced policy settings. This reduces CPU by approximately 85% for DLL load processing at the cost of reduced visibility into library load events.

Contributor

ferullo Mar 24, 2026

Suggested change

      
            For immediate relief on older versions, set `windows.advanced.kernel.asyncimageload` and `windows.advanced.kernel.syncimageload` to `false` in advanced policy settings. This reduces CPU by approximately 85% for DLL load processing at the cost of reduced visibility into library load events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet