Naemon stops executing checks and doesnt respawn Core Worker processes

On a system running `Naemon Core 1.3.0` we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. No `SIGSEGV` or `oom_killer` in action.


Log snippet of the Naemon log (host and servicenames anonymized):
```
[1677024519] Warning:  Check of host 'myhost' did not exit properly!
[1677024519] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024520] wproc: Socket to worker Core Worker 4261 broken, removing
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] HOST ALERT: myhost;DOWN;SOFT;3;CRITICAL - 10.0.0.63: rta nan, lost 100%
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] Warning:  Check of host 'myhost' did not exit properly!
[1677024521] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024521] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4258 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4258 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4260 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4260 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4259 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4259 broken, removing
[1677024526] Warning:  Check of host 'myhost' did not exit properly!
[1677024526] HOST ALERT: myhost;DOWN;SOFT;3;(Host check did not exit properly)
[1677024526] wproc: nm_bufferqueue_read() from Core Worker 4257 returned -1: Connection reset by peer
[1677024526] wproc: Socket to worker Core Worker 4257 broken, removing
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
```

Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.

This also happens with a manual installation with the actual version of the master branch `Naemon Core 1.4.1.g2916d626.20230223`.

Found [this](https://gist.github.com/nook24/9ce03301fbc88c192ea775eb07357458) to reproduce the issue.

After looking into the source code i expected to hit the following if condition which doesnt happen:
https://github.com/naemon/naemon-core/blob/2916d6264b5dd6bbd1c0cb911349cdfc723eada5/src/naemon/workers.c#L431-L436

I will provide a fix for the respawning thing via a pull request.



	if (workers.len <= 0) {
	/* there aren't global workers left, we can't run any more checks
	* we should try respawning a few of the standard ones
	*/
	nm_log(NSLOG_RUNTIME_ERROR, "wproc: All our workers are dead, we can't do anything!");
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions