update older branch by vince-weka · Pull Request #197 · weka/wekachecker

vince-weka · 2025-09-02T15:21:10Z

No description provided.

Raise a warning if the current NFSW FD usage is >=90% of the configured maximum.

- Provide IPv6 support in FIPs sanity, mgmt IP, netmask and SBR checks - Remove jq dependency in NATS check

IPv6 updates and code cleanup

Query NFSW FD usage

Clusters with 0 hot-spare can potentially be configured to allocate and use ~100% of SSD capacity. In the case of loss of a failure domain they'll lose a proportion of FS space (known as shrinkage). The proportion lost is dependent on the number of data disks, and if "too much" is lost then writes can fail with ENOSPACE. Essentially, we should warn customers without hot spares configured

Add a very basic capacity check in the case of 0 hot-spare clusters

In https://wekaio.atlassian.net/browse/WEKAPP-482528 we saw that the link speed was lower than expected, but there were no warnings. We should check that. The only plausible way I can find of doing this is by parsing the text-based output of ethtool, until jq and "ethtool --json" get everywhere, or the kernel interface to ethtool-netlink is exposed in /sys, I can't see any other way of doing it. :(

We don't want weka cluster buckets becoming too full, or too imbalanced. Ordinarily the RAID stripe allocation takes care of this for us, but in at least https://wekaio.atlassian.net/browse/WEKAPP-488736 (for example) network interruptions led to us not being able to find free stripes. This in turn led to buckets becoming full and thus FS writes stalling.

Stupid typo fix

…o proceed

Add a check to examine bucket fill levels

Basic RDMA errors check as per #weka-platform Slack

Basic checker to compare current NIC link speed with maximum

+3 statistics as per internal slack channel

Check to ensure cluster drives have consistent block sizes, and if not, raise a warning.

NVME block size check

This was RCA'd down to too many connections to Ganesha, so we should start checking this.

In WEKAPP-502848 we saw that NFS service was failing over

Need to iterate over all compute / drive / frontend containers.

Be consistent in the containers that are iterated.

…iners_in_gateway_check Should exclude dataservice containers from these checks

Update Infiniband LID mismatch check

Previously we assumed that 'ip rule' would output with the table name at the very end of the line. Depending on how that rule is added it might not be; it could be e.g. 32764: from 10.31.62.119 lookup 100 proto static Therefore instead capture "the word after 'lookup'" instead.

Measuring standard deviation of percentage thresholds is a bit silly, and at least for things like goodput, we want just a straightforward threshold value.

Allow the lookup table to be anywhere in the output of 'ip rule'

…alues Processes which are disconnected return 0 - there are other checks which are more useful for this, so ignore the 0 figure from stats for this test.

Cst jack update statistical outliers

Added basic check for LeaderIterationTooSlow as per IPM check

Check for consistent hugepages allocation between the same containers.

Container memory consistency check

Additionally check for core allocation consistency

Add WARN verbiage.

Update container resource check

…ng, and it's safer to leave it running

In WEKAPP-524442 we saw a performance problem with Mellanox cards because the ib_uverbs module wasn't loaded due to kernel version changes. This prevented DPDK initialising properly.

Cst jack is ibuverbs loaded

Perform net port consistency checks for all backend compute / drive / frontend processes.

Basic OBS connectivity checks: -DNS resolves -IP/Port endpoint socket reachable -Outlier RTT endpoints? -OBS Server error codes past 30 minutes?

In WEKAPP-527332 we saw that a full loopback XFS filesystem at /opt/weka/data/drives0/container/reserved.loop prevented many things (such as the traces, nginx for the API) from starting, with almost *no* error messages whatsoever.

Create basic OBS connectivity checks

Ensure sufficient freespace in loopback filesystems exist

…own-issues

In August 2025 we saw that -- for reasons unknown -- docs (and thus TA Tool) recommended arp_ignore=0. This is unwise as it allows the Linux kernel to respond to ARP requests for NIC A via NIC B, which has a habit of getting switches to learn the wrong MAC Address for their CAM tables. https://wekaio.slack.com/archives/C05QSG8HCF9/p1755274686557269 https://wekaio.slack.com/archives/C03E96QFM33/p1755509369672719

Move from arp_ignore=0 to arp_ignore=1

In WEKAPP-501926 we had a repeat Sev1 of WEKAPP-429276 - check for kn…

Update network mode test

vrragosta and others added 30 commits April 15, 2025 18:00

Query NFSW FD usage

aea9f3c

Raise a warning if the current NFSW FD usage is >=90% of the configured maximum.

IPv6 updates and code cleanup

904b177

- Provide IPv6 support in FIPs sanity, mgmt IP, netmask and SBR checks - Remove jq dependency in NATS check

Minor typo fixes

4301aef

Merge pull request #172 from weka/cst_vragosta_ipv6

3814fb6

IPv6 updates and code cleanup

Merge pull request #167 from weka/cst_vragosta_nfsw_fds

a0591e4

Query NFSW FD usage

Improve wording, fix typos

cadc0d6

Merge pull request #173 from weka/cst_jack_check_hot_spare_capacity

8485bf3

Add a very basic capacity check in the case of 0 hot-spare clusters

Stupid typo fix

19d6534

Merge pull request #175 from weka/cstjack-typo

01f6b25

Stupid typo fix

Clumsily avoid divide by zero risk while still allowing calculation t…

10f2dfe

…o proceed

Merge pull request #176 from weka/cst_jack_check_bucket_fill_levels

400e068

Add a check to examine bucket fill levels

Basic RDMA errors check as per #weka-platform Slack

65b2b4d

Merge pull request #177 from weka/cst_jack_rdma_network_errors

956c368

Basic RDMA errors check as per #weka-platform Slack

Check --stable instead as they're the actually in-use resources

67a8b56

Merge pull request #174 from weka/cst_jack_check_backend_link_speeds

787a6d1

Basic checker to compare current NIC link speed with maximum

+3 statistics as per internal slack channel

943bdbc

Merge pull request #178 from weka/cst_jack_rdma_network_errors

22beea8

+3 statistics as per internal slack channel

NVME block size check

fc70123

Check to ensure cluster drives have consistent block sizes, and if not, raise a warning.

Merge pull request #179 from weka/cst_vragosta_nvme_bs

b47322d

NVME block size check

In WEKAPP-502848 we saw that NFS service was failing over

edcbd94

This was RCA'd down to too many connections to Ganesha, so we should start checking this.

Merge pull request #180 from weka/cst_jack_count_nfs_connections

173f532

In WEKAPP-502848 we saw that NFS service was failing over

Update Infiniband LID mismatch check

97db8c1

Need to iterate over all compute / drive / frontend containers.

Update Infiniband LID mismatch check

d502a81

Be consistent in the containers that are iterated.

Should exclude dataservice containers from these checks

ed96851

Merge pull request #182 from weka/cst_jack_skip_dataserv_and_s3_conta…

500dbcd

…iners_in_gateway_check Should exclude dataservice containers from these checks

Merge pull request #181 from weka/cst_vragosta_infiniband_lid_mismatch

8334ea6

Update Infiniband LID mismatch check

Add a check for OBS targets in default configuration as per IPM request

8a57ff6

jackchallen and others added 27 commits June 25, 2025 10:34

Move the percentage-based threshold tests into their own test

b04580a

Measuring standard deviation of percentage thresholds is a bit silly, and at least for things like goodput, we want just a straightforward threshold value.

Merge pull request #186 from weka/cst_jack_fix_up_sbr_rule_check

cbda3d4

Allow the lookup table to be anywhere in the output of 'ip rule'

Filter disconnected processes from reporting by ignoring zeroed-out v…

2aed212

…alues Processes which are disconnected return 0 - there are other checks which are more useful for this, so ignore the 0 figure from stats for this test.

Merge pull request #185 from weka/cst_jack_update_statistical_outliers

9a24685

Cst jack update statistical outliers

Ignore errors about not being registered, and only check the last day

bdec9dc

Merge pull request #184 from weka/cst_jack_leader_iteration_too_slow

3824f13

Added basic check for LeaderIterationTooSlow as per IPM check

Container memory consistency check

2227f26

Check for consistent hugepages allocation between the same containers.

Merge pull request #187 from weka/cst_vragosta_container_mem_alloc

917b127

Container memory consistency check

Update container resource check

a56333a

Additionally check for core allocation consistency

Update container resource check

0e52fa9

Add WARN verbiage.

Merge pull request #188 from weka/cst_vragosta_container_mem_alloc

82fd90e

Update container resource check

We should run this on every host

02905ca

We should not stop mst, because we don't know if it was already runni…

ba88561

…ng, and it's safer to leave it running

Check for if ib_uverbs is loaded if Mellanox cards are found

e60d1c8

In WEKAPP-524442 we saw a performance problem with Mellanox cards because the ib_uverbs module wasn't loaded due to kernel version changes. This prevented DPDK initialising properly.

Merge pull request #189 from weka/cst_jack_is_ibuverbs_loaded

4f30544

Cst jack is ibuverbs loaded

Update network mode test

e0e80dc

Perform net port consistency checks for all backend compute / drive / frontend processes.

Create basic OBS connectivity checks

d70f611

Basic OBS connectivity checks: -DNS resolves -IP/Port endpoint socket reachable -Outlier RTT endpoints? -OBS Server error codes past 30 minutes?

Merge pull request #191 from weka/cst_vragosta_obs_con_sanity

350d26b

Create basic OBS connectivity checks

Merge pull request #192 from weka/cst_jack_check_for_full_loopback

8377d78

Ensure sufficient freespace in loopback filesystems exist

In WEKAPP-501926 we had a repeat Sev1 of WEKAPP-429276 - check for kn…

51d2c2b

…own-issues

Merge pull request #195 from weka/cst/jack/2025-08-18_fix_sbr_arp_ignore

f2f4f9c

Move from arp_ignore=0 to arp_ignore=1

Merge pull request #193 from weka/cst/jack/empty_acl_segfault

7efc0be

In WEKAPP-501926 we had a repeat Sev1 of WEKAPP-429276 - check for kn…

add 460_ to default

cffbc67

Merge pull request #190 from weka/cst_vragosta_netmode

21c374e

Update network mode test

vince-weka closed this Sep 2, 2025

vince-weka reopened this Sep 2, 2025

vince-weka merged commit de2cd06 into vince/2025-04-25 Sep 2, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update older branch#197

update older branch#197
vince-weka merged 62 commits intovince/2025-04-25from
master

vince-weka commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vince-weka commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants