-
Notifications
You must be signed in to change notification settings - Fork 19
automatic troubleshooting endpoint context docs #728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
79ea268
add high_cpu, missed_checkins, bsod, trustedapps context docs
joeypoon d4c162e
add endpoint_exceptions and incompatible_software context docs
joeypoon 537edfe
add output_config context doc for Kafka/Logstash output troubleshooting
joeypoon c0a6c43
add device_control context doc for notification and serial number issues
joeypoon 1d7028e
address comments
joeypoon 7737064
Apply suggestions from code review
joeypoon 63e09e7
Apply suggestion from @ferullo
joeypoon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
69 changes: 69 additions & 0 deletions
69
package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| --- | ||
| type: automatic_troubleshooting | ||
| sub_type: bsod | ||
| os: [Windows] | ||
| date: '2026-03-09' | ||
| --- | ||
|
|
||
| ## Symptom | ||
|
|
||
| A Windows system running Elastic Defend experiences a Blue Screen of Death (BSOD) or kernel crash. The memory dump analysis references `elastic_endpoint_driver.sys` or `elastic-endpoint-driver.sys`. The crash may occur shortly after an agent upgrade, 3rd party security product installation or its configuration change, during heavy I/O workloads, or after the system has been running for some time with many network connections. In severe cases the system enters a boot loop. | ||
|
|
||
|
|
||
| ## Summary | ||
|
|
||
| Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions. | ||
|
|
||
| Collecting a full kernel memory dump and sharing it with Elastic Support is essential for root-cause determination. The bugcheck code alone is not sufficient — the faulting call stack identifies which code path triggered the crash. Just because Elastic Defend is in the calls stack does not mean it is responsible for the crash. | ||
|
|
||
|
|
||
| ## Common issues | ||
|
|
||
| ### Network driver pool corruption (8.17.8, 8.18.3, 9.0.3) | ||
|
|
||
| A regression in the network driver introduced in Elastic Defend versions 8.17.8, 8.18.3, and 9.0.3 can cause kernel pool corruption on systems with a large number of long-lived network connections that remain inactive for 30+ minutes. The corruption manifests as BSODs with various bugcheck codes including `IRQL_NOT_LESS_OR_EQUAL`, `SYSTEM_SERVICE_EXCEPTION`, `KERNEL_MODE_HEAP_CORRUPTION`, or `PAGE_FAULT_IN_NONPAGED_AREA`. | ||
|
|
||
| This is the most frequently reported BSOD pattern and affects Windows Server environments with persistent connections (e.g. database servers, backup servers running Veeam with PostgreSQL). | ||
|
|
||
| **Affected versions**: 8.17.8, 8.18.3, 9.0.3 only. | ||
|
|
||
| **Fixed versions**: 8.17.9, 8.18.4, 9.0.4. Hotfix builds are also available: 8.18.3+build202507101319 and 9.0.3+build202507110136. | ||
|
|
||
| **Mitigation**: Upgrade to a fixed version. If immediate upgrade is not possible, set `advanced.kernel.network: false` in the Elastic Defend advanced policy settings to disable the kernel network driver. | ||
|
|
||
| ### ODX-enabled volume crash (8.19.8, 9.1.8, 9.2.2) | ||
|
|
||
| A regression introduced in versions 8.19.8, 9.1.8, and 9.2.2 causes BSODs on systems with ODX (Offloaded Data Transfer) enabled volumes, particularly affecting Hyper-V clusters and Windows Server 2016 Datacenter. The crash can appear 2-3 hours after an agent upgrade, often triggered when the storage subsystem processes asynchronous offload write operations. | ||
|
|
||
| **Affected versions**: 8.19.8, 9.1.8, 9.2.2 only. | ||
|
|
||
| **Fixed version**: 9.2.4. | ||
|
|
||
| **Mitigation**: Upgrade to 9.2.4+ which contains the fix. | ||
|
|
||
| ### Third-party kernel driver conflicts | ||
|
|
||
| Other security products running kernel-mode drivers can interfere with Elastic Defend's driver initialization or runtime operation. The most commonly reported conflicts include: | ||
|
|
||
| - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. This interaction was introduced by an Elastic Defend refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Upgrade to a fixed version to resolve. | ||
|
|
||
| - **CrowdStrike, Kaspersky, Windows Defender coexistence**: Running multiple endpoint security products increases the probability of kernel-level interactions. Each additional kernel-mode filter driver introduces another point of contention for file system, registry, and network callbacks. When BSODs occur on systems with multiple security products, simplify by removing redundant products. | ||
|
|
||
| ### Unsupported OS version | ||
|
|
||
| Upgrading Elastic Defend to a version that does not support the host's Windows version causes immediate BSODs or boot loops. Support for Windows Server 2012 R2 was dropped in 8.13.0 and re-added in 8.16.0. The system crashes during driver load because the driver uses kernel APIs unavailable on the older OS. | ||
|
|
||
| **Recovery**: Boot into Safe Mode or the Windows Recovery Console and delete `C:\Windows\System32\drivers\elastic-endpoint-driver.sys`. This prevents the driver from loading on the next boot. Then move the agent to a policy without the Elastic Defend integration, or upgrade to a version that re-added support (8.16.0+ for Windows Server 2012 R2). | ||
|
|
||
| **Prevention**: Check the [Elastic Defend support matrix](https://www.elastic.co/support/matrix) before upgrading agents across a fleet. Use separate agent policies for older OS versions that require pinned agent versions. | ||
|
|
||
| ## Investigation priorities | ||
|
|
||
| 1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Share the dump with Elastic. | ||
| 2) Check the Elastic Defend version at the time of crash. Query `.fleet-agents*` for the agent version and `metrics-endpoint.metadata_current_*` for the endpoint version and OS details. Cross-reference against the known affected versions listed above (8.17.8, 8.18.3, 9.0.3 for network driver; 8.19.8, 9.1.8, 9.2.2 for ODX). | ||
| 3) Determine whether the BSOD started after a specific agent or OS upgrade. Check `.fleet-agents*` for recent version changes and correlate with the crash timeline. | ||
| 4) Identify other kernel-mode security products installed on the system. Look for drivers like `klflt.sys` (Kaspersky), `mfehidk.sys` (Trellix/McAfee), `csagent.sys` (CrowdStrike), or other filter drivers in the WinDbg module list. | ||
| 5) Check the Windows version against the Elastic Defend support matrix. Query `metrics-endpoint.metadata_current_*` for `host.os.version` and `host.os.name`. | ||
| 6) Look for gaps in endpoint metadata timestamps in `metrics-endpoint.metadata_current_*` — an offline gap followed by a version change often indicates a crash-recovery-rollback sequence. | ||
| 7) Check `metrics-endpoint.policy-*` for `connect_kernel` failures, which indicate the driver failed to load or initialize properly after a crash. | ||
| 8) If the system is in a boot loop, guide the user to boot into Safe Mode, delete the driver file at `C:\Windows\System32\drivers\elastic-endpoint-driver.sys`, then boot normally and downgrade or uninstall the agent. |
65 changes: 65 additions & 0 deletions
65
package/endpoint/docs/knowledge_base/device_control/device_control_notification.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| --- | ||
| type: automatic_troubleshooting | ||
| sub_type: device_control | ||
| link: https://elastic.co/docs/solutions/security/configure-elastic-defend/configure-an-integration-policy-for-elastic-defend | ||
| os: [Windows] | ||
| date: '2026-03-11' | ||
| --- | ||
|
|
||
| ## Symptom | ||
|
|
||
| A custom notification message has been configured in the Elastic Defend Device Control policy to display when a USB device is blocked, but the Windows system tray popup does not appear. Instead, the user sees only a generic Windows Explorer error stating the device is not accessible. Alternatively, device-specific allow/block rules based on `device.serial_number` do not match the intended device because the serial number field contains `0` or a seemingly random value. | ||
|
|
||
|
|
||
| ## Summary | ||
|
|
||
| Elastic Defend's Device Control feature (available since 9.2) can block USB storage devices and display a custom notification message to the user. On Windows, USB mount events are generated by the SYSTEM service process rather than the interactive user session. Prior to 9.4.0, the notification popup was sent to the desktop session associated with the USB mount event — which is the SYSTEM service desktop. Because most systems do not have an interactive desktop for the SYSTEM service, the popup silently failed to display. | ||
|
|
||
| A separate limitation affects device-specific rules. The `device.serial_number` field is populated by querying the kernel driver via `IOCTL_STORAGE_QUERY_PROPERTY`, but USB serial numbers are inherently unreliable on Windows — many devices report `0`, an empty string, or a random value. This makes `device.serial_number` unsuitable as the sole identifier for device-specific allow/block rules. The `device.id` field contains the Windows PNP Device ID which is more consistently populated, but it has different semantics than a serial number and cannot be used interchangeably with `device.serial_number` in Device Control rules. | ||
|
|
||
|
|
||
| ## Common issues | ||
|
|
||
| ### Custom notification popup not appearing (pre-9.4.0) | ||
|
|
||
| When a USB device is blocked by Device Control, the endpoint generates a notification event and attempts to display the configured custom message as a Windows system tray popup. On versions prior to 9.4.0, the popup is targeted at the desktop session that originated the USB mount event. Because USB mount events come from the SYSTEM service process (`services.exe` / PID 0 session), the popup is sent to the SYSTEM service desktop — which does not exist as an interactive session on most systems. The popup is created but never rendered. | ||
|
|
||
| This affects all Device Control block actions regardless of user privilege level. Even administrators logged into an interactive session do not see the popup because the popup is dispatched to the wrong session. | ||
|
|
||
| **Fixed in**: 9.4.0. The fix changes the behavior so that when a USB device event triggers a block notification, the popup is shown on all interactive desktop sessions instead of only the session that originated the mount event. | ||
|
|
||
| **Workaround** (pre-9.4.0): There is no workaround for displaying the custom notification. The device is still blocked — the user will see the standard Windows Explorer "device not accessible" error, which confirms the block is in effect even though the custom message is not shown. | ||
|
|
||
| ### Windows Do Not Disturb suppressing notifications | ||
|
|
||
| Even on 9.4.0+ where the popup is correctly dispatched to interactive desktops, Windows Do Not Disturb (Focus Assist) can suppress the notification. When DND is enabled, Windows silently drops or queues system tray popups from all applications, including Elastic Defend. | ||
|
|
||
| To diagnose: check Settings > System > Focus assist (Windows 10) or Settings > System > Notifications > Do not disturb (Windows 11/Server 2022+). If DND is enabled or configured for automatic rules (e.g. during presentations, full-screen apps), the notification will not appear until DND is disabled. | ||
|
|
||
| Also verify that notifications are enabled for the Elastic Endpoint application in Settings > System > Notifications. If the Elastic Endpoint application is explicitly set to "Off", no notifications will appear regardless of DND state. | ||
|
|
||
| ### `device.serial_number` unreliable for device-specific rules | ||
|
|
||
| The `device.serial_number` field frequently contains `0` or a random single-digit value, making device-specific Device Control rules ineffective when they rely on this field. This is not a bug in Elastic Defend — USB serial numbers are inherently unreliable on Windows. Many USB devices do not have manufacturer-programmed serial numbers, and the Windows storage stack returns a generated instance ID instead of a true serial number. | ||
|
|
||
| Elastic Defend queries the serial number from the kernel driver using `DeviceIoControl` with `IOCTL_STORAGE_QUERY_PROPERTY`. When the device does not report a serial number, the query returns `0` or an empty value. Some devices report inconsistent values across different USB ports or after re-enumeration. | ||
|
|
||
| **Workaround**: Instead of relying solely on `device.serial_number`, use a combination of `device.vendor_id` and `device.product_id` to identify device classes. Query `logs-endpoint.events.device-*` for the target device to check which fields are reliably populated. The `device.id` field (which contains the Windows PNP Device ID) is more consistently available, but it is not currently usable as a condition field in Device Control rules. | ||
|
|
||
| **Improvement planned**: Groundwork has been done to re-gather the serial number in user space after the device connects, which may improve reliability for devices that expose the serial number through registry enumeration rather than the kernel storage query. | ||
|
|
||
| ### Device Control rules not matching expected devices | ||
|
|
||
| When `device.serial_number` is unreliable, rules that use serial number conditions will either fail to match intended devices or unintentionally match unrelated devices that happen to share the same `0` or generated value. A rule configured to allow a specific USB drive by serial number `0` would match every device that reports `0` — effectively allowing all devices without true serial numbers. | ||
|
|
||
| Review Device Control rules that use `device.serial_number` conditions. For each rule, query `logs-endpoint.events.device-*` to verify the actual serial number value reported for the target device. If the value is `0`, empty, or inconsistent, switch to identifying the device by `device.vendor_id` and `device.product_id` instead. | ||
|
|
||
|
|
||
| ## Investigation priorities | ||
|
|
||
| 1) Confirm the Elastic Defend version. If pre-9.4.0, the missing notification popup is a known issue — upgrade to 9.4.0+ to resolve it. | ||
| 2) If on 9.4.0+ and notifications still do not appear, check Windows Do Not Disturb and notification settings on the affected endpoint. Verify DND is disabled and Elastic Endpoint notifications are enabled. | ||
| 3) For serial number issues, query `logs-endpoint.events.device-*` for the target device and inspect the `device.serial_number`, `device.id`, `device.vendor_id`, and `device.product_id` fields. Determine which fields are reliably populated and adjust Device Control rules accordingly. | ||
| 4) Verify the Device Control policy configuration via `get_package_configurations` — confirm that custom notification text is configured and that device rules reference fields with reliable values. | ||
| 5) Check `logs-endpoint.events.device-*` for device mount/unmount events to confirm Elastic Defend is detecting the USB device at all. If no device events are present, the Device Control feature may not be enabled in the policy. | ||
| 6) For block actions that should be working but are not preventing device access, confirm the Device Control mode is set to "Block" rather than "Detect" in the integration policy. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two situations are completely different. Should a single MD doc cover them both or is it better to break this doc up? I'm holding off on reading this file until this is answered.
I have no personal preference, I'm just bringing this up in case it helps with context windows.