On a system running Naemon Core 1.3.0 we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. No SIGSEGV or oom_killer in action.
Log snippet of the Naemon log (host and servicenames anonymized):
[1677024519] Warning: Check of host 'myhost' did not exit properly!
[1677024519] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024520] wproc: Socket to worker Core Worker 4261 broken, removing
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] HOST ALERT: myhost;DOWN;SOFT;3;CRITICAL - 10.0.0.63: rta nan, lost 100%
[1677024521] Warning: Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] Warning: Check of host 'myhost' did not exit properly!
[1677024521] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024521] Warning: Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024521] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4258 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4258 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4260 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4260 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4259 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4259 broken, removing
[1677024526] Warning: Check of host 'myhost' did not exit properly!
[1677024526] HOST ALERT: myhost;DOWN;SOFT;3;(Host check did not exit properly)
[1677024526] wproc: nm_bufferqueue_read() from Core Worker 4257 returned -1: Connection reset by peer
[1677024526] wproc: Socket to worker Core Worker 4257 broken, removing
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.
This also happens with a manual installation with the actual version of the master branch Naemon Core 1.4.1.g2916d626.20230223.
Found this to reproduce the issue.
After looking into the source code i expected to hit the following if condition which doesnt happen:
|
if (workers.len <= 0) { |
|
/* there aren't global workers left, we can't run any more checks |
|
* we should try respawning a few of the standard ones |
|
*/ |
|
nm_log(NSLOG_RUNTIME_ERROR, "wproc: All our workers are dead, we can't do anything!"); |
|
} |
I will provide a fix for the respawning thing via a pull request.
On a system running
Naemon Core 1.3.0we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. NoSIGSEGVoroom_killerin action.Log snippet of the Naemon log (host and servicenames anonymized):
Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.
This also happens with a manual installation with the actual version of the master branch
Naemon Core 1.4.1.g2916d626.20230223.Found this to reproduce the issue.
After looking into the source code i expected to hit the following if condition which doesnt happen:
naemon-core/src/naemon/workers.c
Lines 431 to 436 in 2916d62
I will provide a fix for the respawning thing via a pull request.