Run periodic healthcheck on a service to automatically stop or start BIRD advertisements.
Inspired by this birdwatcher by Skoef.
Here is a demo of a Grafana dashboard showing the hysteresis of service. The green curve show when the service is considered active. The yellow curve show a more detailed view, showing the internal state of birdwatcher-rs
[generated_file]
path = "birdwatcher_generated.conf"
[bird_reload]
command = ["birdc", "configure"]
timeout_s = 2
[[service_definitions]]
service_name = "first_service"
function_name = "my_service_is_ok"
command = ["my_service_check.sh"]
command_timeout_s = 1
interval_s = 5
fall = 5
rise = 8- At first, the service is down (green curve at zero)
- Then, the function return SUCCESS two times in a row, then FAIL, then SUCCESS 3 times in a row. (yellow curve)
This does not affect the service as we need 8 SUCCESS torisethe service. (see config above). - The function return SUCCESS more than 8 times in a row (yellow curve going up to 1), so the service becomes up (green curve to 1).
- The
generated_fileis regen withreturn trueand thebird reloadcommand (typicallybirdc configure) is called. - After sometimes, the function start failing.
Note that the yellow curve goes down faster than it goes up, because it only needs 5 FAIL tofall.
To reproduce, use the docker compose in test_tools folder:
cd test_tools
docker compose up --abort-on-container-exit
OTEL_METRIC_EXPORT_INTERVAL=1 BIRDWATCHER_TRACE_LEVEL=info BIRDWATCHER_LOG_LEVEL=birdwatcher_daemon=debug,info nix run .# -- --config example/birdwatcher_random.confQuick start:
nix run github:pixelshot91/birdwatcher-rs -- --help
To use install it on NixOs, look up integration_test/single_service.nix
- Install Rust
cargo run -- --config my_config.toml
The configuration file use TOML.
[generated_file]
path = "birdwatcher_generated.conf"
[bird_reload]
command = ["birdc", "configure"]
timeout_s = 2
[[service_definitions]]
service_name = "webserver is up"
function_name = "webserver_is_active"
command = ["/bin/curl", "http://localhost:8000/"]
command_timeout_s = 2
interval_s = 1
fall = 1
rise = 3
[[service_definitions]]
service_name = "My important file exist"
function_name = "file_exist"
command = ["/bin/ls", "/root/my_file.txt"]
command_timeout_s = 1
interval_s = 5
fall = 1
rise = 3By default,telemetry is send to endpoint http://localhost:4317 using the gRPC protocol.
You can change the endpoint by settings the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.
https://docs.rs/opentelemetry-otlp/0.30.0/opentelemetry_otlp/#constants
You can define Resources with this env var OTEL_RESOURCE_ATTRIBUTES.
Example: OTEL_RESOURCE_ATTRIBUTES='service.name=my_service_name'
| Name | Type | Unit | Description |
|---|---|---|---|
| birdwatcher_service_up | gauge | 0 = The service is down. 1 = The service is up | |
| birdwatcher_service_hysteresis_state | gauge | Like service_up, but more detailed. It aggregates the result the last function_return value.It can take intermediate values between 0 and 1 for a failed service raising, or a successful service failing. |
|
| birdwatcher_function_return_value | gauge | Return value of a function. |
Example of metric using example/birdwatcher_random.conf, extracted from the Live debugging of the prometheus.remote_write.local link
metadata: labels={__name__="birdwatcher_service_up"}, type="gauge", unit="", help="0 = The service is down. 1 = The service is up"
sample: ts=1772754547924, labels={__name__="birdwatcher_service_hysteresis_state", job="unknown_service", service="random3_service"}, value=1.000000
metadata: labels={__name__="birdwatcher_service_hysteresis_state"}, type="gauge", unit="", help="Like service_up, but more detailed. It aggregates the result the last function_return value.\n It can take intermediate values between 0 and 1 for a failed service raising, or a successful service failing"
sample: ts=1772754547924, labels={__name__="birdwatcher_function_return_value", job="unknown_service", service="random3_service"}, value=1.000000
metadata: labels={__name__="birdwatcher_function_return_value"}, type="gauge", unit="", help="Return value of a function."
sample: ts=1772754547929, labels={__name__="target_info", job="unknown_service", telemetry_sdk_language="rust", telemetry_sdk_name="opentelemetry", telemetry_sdk_version="0.30.0"}, value=1.000000
Metric are exported every 60s by default. This can be modified with OTEL_METRIC_EXPORT_INTERVAL.
Both trace and log can be filtered with the following env var: BIRDWATCHER_TRACE_LEVEL, BIRDWATCHER_LOG_LEVEL
Level can be: error, warn, info, debug, trace
Levels are defined here: https://docs.rs/tracing-core/0.1.35/tracing_core/metadata/struct.Level.html#implementations
Example: BIRDWATCHER_LOG_LEVEL=info
By default, only log level error are shown.
Warning: BIRDWATCHER_LOG_LEVEL set the log level for birdwatcher-rs and all its dependency (mainly the telemetry SDK).
If you want to enable debug or trace with BIRDWATCHER_LOG_LEVEL=debug/trace, you will be submerged by logs not generated by birdwatcher it-self, but its dependencies.
If you only want to control the log of birdwatcher-rs, use this directive BIRDWATCHER_LOG_LEVEL=birdwatcher_daemon=debug
More info about EnvFilter directive: https://docs.rs/tracing-subscriber/0.3.19/tracing_subscriber/filter/struct.EnvFilter.html#directives .
Little improvements over Skoef's birdwatcher:
birdwatcher-rsdoes not link a service with a list of IP addresses likebirdwatcherdoes.
birdwatcher-rsonly return a function return a simpletrueorfalse, whereasbirdwatcherforce to use a BIRD filter, which might prevent some configuration, and caused some problems.
See skoef/birdwatcher#28 and skoef/birdwatcher#25- I tried to make
birdwatcher-rscode simpler. The only moving part is a list of value representing the service states. birdwatcher-rsenable the service if the first check is successful, whatever therisevalue is.- It uses Rust, which I find cleaner and prevent entire classes of bugs compared to Go, such as null-pointer exception, generics, sum-type.
birdwatcher-rssupport Nix, for easy and reproducible build. It include an integration test running birdwatcher-rs in a VM with NixOS test framekwork.
To build from source:
nix build
result/bin/birdwatcher-rs --config my_config.toml
For the daemon and the client to communicate, I need a RPC framework.
I would like it to be, by order of importance:
- correct: The de/serialization code should be generated. Function name should not be matched as string, but be part of Trait. An error in function name, return type, parameter count or type should be a compile-time error.
- thread-safe: Although I do not plan many clients to query the daemon, I still want it to resilient against concurrent queries. It should not be possible for a client to see an answer intended for another client.
The framework should link each query/response automatically - Maintained: A high number of contributor, frequent releases, and no unanswered issues.
Not very useful:
- High performance: It will be used as a monitoring tool, at the very most one query per second.
- Inter-language: The client could be written is something else than Rust.
Option:
- tonic: Seems to most popular, but need to redefine all struct in protobuf. A one person project tonci-rpc exists to automatically generate the protobuf from the Rust code, but is unmaintained since February 2023.
- tarpc Generate message directly from Rust code, but not lot of activity. Weird dependency on
opentelemetry. - jsonrpsee function are not well-typed: function name is a string, and parameter type and count are check at runtime.