Skip to content

update older branch#197

Merged
vince-weka merged 62 commits intovince/2025-04-25from
master
Sep 2, 2025
Merged

update older branch#197
vince-weka merged 62 commits intovince/2025-04-25from
master

Conversation

@vince-weka
Copy link
Copy Markdown
Contributor

No description provided.

vrragosta and others added 30 commits April 15, 2025 18:00
Raise a warning if the current NFSW FD usage is >=90% of the configured maximum.
- Provide IPv6 support in FIPs sanity, mgmt IP, netmask and SBR checks

- Remove jq dependency in NATS check
IPv6 updates and code cleanup
Clusters with 0 hot-spare can potentially be configured to allocate
and use ~100% of SSD capacity. In the case of loss of a failure
domain they'll lose a proportion of FS space (known as shrinkage).
The proportion lost is dependent on the number of data disks, and
if "too much" is lost then writes can fail with ENOSPACE.

Essentially, we should warn customers without hot spares configured
Add a very basic capacity check in the case of 0 hot-spare clusters
In https://wekaio.atlassian.net/browse/WEKAPP-482528 we saw that
the link speed was lower than expected, but there were no warnings.
We should check that.

The only plausible way I can find of doing this is by parsing the
text-based output of ethtool, until jq and "ethtool --json" get
everywhere, or the kernel interface to ethtool-netlink is exposed
in /sys, I can't see any other way of doing it. :(
We don't want weka cluster buckets becoming too full, or too
imbalanced. Ordinarily the RAID stripe allocation takes care of
this for us, but in at least https://wekaio.atlassian.net/browse/WEKAPP-488736
(for example) network interruptions led to us not being able to
find free stripes. This in turn led to buckets becoming full and
thus FS writes stalling.
Add a check to examine bucket fill levels
Basic RDMA errors check as per #weka-platform Slack
Basic checker to compare current NIC link speed with maximum
+3 statistics as per internal slack channel
Check to ensure cluster drives have consistent block sizes, and if not, raise a warning.
This was RCA'd down to too many connections to Ganesha, so we
should start checking this.
In WEKAPP-502848 we saw that NFS service was failing over
Need to iterate over all compute / drive / frontend containers.
Be consistent in the containers that are iterated.
…iners_in_gateway_check

Should exclude dataservice containers from these checks
jackchallen and others added 27 commits June 25, 2025 10:34
Previously we assumed that 'ip rule' would output with the table
name at the very end of the line. Depending on how that rule is
added it might not be; it could be e.g.

32764:  from 10.31.62.119 lookup 100 proto static

Therefore instead capture "the word after 'lookup'" instead.
Measuring standard deviation of percentage thresholds is a bit silly,
and at least for things like goodput, we want just a straightforward
threshold value.
Allow the lookup table to be anywhere in the output of 'ip rule'
…alues

Processes which are disconnected return 0 - there are other checks which
are more useful for this, so ignore the 0 figure from stats for this test.
Added basic check for LeaderIterationTooSlow as per IPM check
Check for consistent hugepages allocation between the same containers.
Additionally check for core allocation consistency
In WEKAPP-524442 we saw a performance problem with Mellanox cards
because the ib_uverbs module wasn't loaded due to kernel version
changes. This prevented DPDK initialising properly.
Perform net port consistency checks for all backend compute / drive / frontend processes.
Basic OBS connectivity checks:
-DNS resolves
-IP/Port endpoint socket reachable
-Outlier RTT endpoints?
-OBS Server error codes past 30 minutes?
In WEKAPP-527332 we saw that a full loopback XFS filesystem at
 /opt/weka/data/drives0/container/reserved.loop
prevented many things (such as the traces, nginx for the API) from
starting, with almost *no* error messages whatsoever.
Create basic OBS connectivity checks
Ensure sufficient freespace in loopback filesystems exist
In August 2025 we saw that -- for reasons unknown -- docs (and
thus TA Tool) recommended arp_ignore=0. This is unwise as it
allows the Linux kernel to respond to ARP requests for NIC A via
NIC B, which has a habit of getting switches to learn the wrong
MAC Address for their CAM tables.

https://wekaio.slack.com/archives/C05QSG8HCF9/p1755274686557269
https://wekaio.slack.com/archives/C03E96QFM33/p1755509369672719
In WEKAPP-501926 we had a repeat Sev1 of WEKAPP-429276 - check for kn…
@vince-weka vince-weka closed this Sep 2, 2025
@vince-weka vince-weka reopened this Sep 2, 2025
@vince-weka vince-weka merged commit de2cd06 into vince/2025-04-25 Sep 2, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants