Skip to content

Ut#3

Open
asn1809 wants to merge 57 commits into
mainfrom
ut
Open

Ut#3
asn1809 wants to merge 57 commits into
mainfrom
ut

Conversation

@asn1809
Copy link
Copy Markdown
Owner

@asn1809 asn1809 commented May 2, 2024

No description provided.

Annaraya-Narasagond and others added 30 commits April 10, 2024 12:28
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Cleanup protectedPVCs that are stale
Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Fixes: RamenDR#1200
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
Also bump github.com/golang/protobuf to 1.5.4

To address related security vulnerability, see:
- https://www.cve.org/CVERecord?id=CVE-2024-24786
- https://groups.google.com/g/golang-announce/c/ArQ6CDgtEjY/m/oLMrdq_GBQAJ

Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
By default, clusteradm installs the latest release. Extract a
BUNDLE_VERSION constant to allow specifying specific ocm version.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
To test ocm changes, we need to build and push the images to a private
image repository. When deploying we can use the new IMAGE_REGISTRY=
constant to specify the image registry.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
This version[1] pulls ocm 0.13.1[2], fixing auto approval failures after
joining the hub.

[1] https://github.com/open-cluster-management-io/clusteradm/releases/tag/v0.8.1
[2] https://github.com/open-cluster-management-io/ocm/releases/tag/v0.13.1

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Ignore some flake8 rules conflicting with black code style so we can use
automatic formatting.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Use the same configuration used in the full environment. This should
make testing the minimal environment closer to the full one.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Rook quick start guide recommends to check `ceph status` using the
toolbox after deploying the cluster[1]. Move the rook ceph toolbox
before the rook pool, so we can validate the cluster status.

[1] https://rook.io/docs/rook/latest/Getting-Started/quickstart/#create-a-ceph-cluster

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
To make sure that we wait correctly for the cluster. On the next failure
the cluster status will be logged.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
This may help debug issues with ceph, and also validates that the
toolbox works.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Hopefully this will make it easier to debug random failures in
rbd-mirror.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We add cephrbdmirror resource, but we don't wait until it is reconciled
and become ready. Wait and log the resource status to make debugging
easier on the next random timeout.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Hopefully this will help to debug issue when we have the next random
timeout in rbd-mirror test.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We have random failure timing out waiting for rbd-mirror. One possible
reason may be bad ceph blocklist blocking rbd-daemon. Log the ceph osd
blocklist before we wait for rbd daemon.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Previously we tried to wait for all deployment when starting a running
minikube profile. This works most of the time, but fail if a deployment
is in failed state (Progressing=False).

Fix by restarting all the failed deployments. We don't wait until they
are rolled out again, since the addons already wait for the deployments.

I could reproduce the issue once with rook-ceph-operator, and restarting
the deployment fixed it.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Abhijeet Shakya <abhijeetshakya21@gmail.com>
Previously limited to 1 worker per cluster due to various issues. Since
the issues are fixed now, we can remove this limit.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Create the test structure upfront instead of building it in
write_output. This will make it easier to add more info.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When comparing runs we want to make sure we tested the same code.
Normally you don't update the code during a run, so we can get the git
commit and branch once at the start.

Example:

    $ head -5 out/test.json
    {
      "git": {
        "commit": "9bb63eb6a7e0dfec1bc20144f81f84f4ed1540fb",
        "branch": "stress-git-info"
      },

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
kubectl.label() has confusing arguments names. The first argument is the
resource, and the second is the label (key=value or key-). Rename to
make this more clear.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs and others added 27 commits May 2, 2024 07:45
The test helper "worked" since after starting the test clsuter the
default context is updated by minikube. But if you start another
environment or change the default context manually, the test would fail
trying the access the wrong cluster, or worse, succeed silently
while modifying the wrong cluster.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When gathering logs from pods, we read large amount of data from
kubectl, and write it to a file. Decoding every line on read and
encoding on write is wasteful. Add keepends= and decode= arguments to
commands.watch() so it can be used for gathering logs.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Sometimes we want to to stop watching a command early and terminate the
command, ignoring the exist code.

An example is watching a resource status:

    $ kubectl get foo/bar -o jsonpath='{.status}{"\n"} --watch
    {"phase": "Create"}
    {"phase": "Create"}
    {"phase": "Ready"}

We want to stop watching when "phase" is "Ready". This change adds this
capability by handling the GeneratorExit exception raised inside
commands.watch() when you close the return value. When closed, we kill
the watched process and return, ignoring the exit code.

Example usage:

    # Keep the generator object.
    watcher = commands.watch("kubectl", "get", "foo/bar",  "-o", "jsonpath={.status}{"\\n"}', "--watch")

    # Iterate over it...
    for line in watcher:
        status = json.loads(line)
        if status["phase"] == "Ready":
            # We are done!
            watcher.close()

With this we can watch resources efficiently without polling. We could
do this with kubectl.wait(), but now we can detect a timeout, and we can
implement complex waiting logic not possible using jsonpath.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When running a command that does not support timeout, we can implement
the timeout on our side. This change adds a timeout argument to
commands.watch(). If the watched command does not terminate within the
specified timeout, we kill it and raise commands.Timeout exception.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
This is a higher level helper for using `kubectl get --watch`. We
support only jsonpath output since we must have one line per event.

Because kubectl returns raw value for leaf nodes ({.status.phase} ->
Ready) instead of a json value ({.status.phase} -> "Ready"), we
cannot parse the json value in the helper.

Example usage - watching status changes:

    for line in kubectl.watch(
        "deploy/example",
        jsonpath="{.status}",
        context=context,
    ):
        status = json.loads(line)
        print(status)

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Based on ocs-operator commit[1], recommended by Ilya Dryomov[2]. We use
static configuration for simplicity, using rook-ceph-override
configmap[3].

We don't enable logging to file so we can get the logs via kubectl and
not use minikube specific code.

[1] red-hat-storage/ocs-operator@e39bb41
[2] https://tracker.ceph.com/issues/65487#note-4
[3] https://rook.io/docs/rook/latest-release/Storage-Configuration/Advanced/ceph-configuration/#example

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
With this configuration the rbd-mirror daemon logs also to
/data/rook/rook-ceph/log/ceph-client.rbd-mirror.a.log

This log file is pretty big, growing to 90 MiB in 12 hours on an idle
system, so I hope we can revert this change soon. Keeping this as
separate commit to make it easy to revert.

To copy the entire logs you need to use minikube specific code:

    minikube cp -p dr1 dr1:/data/rook/rook-ceph/log/ceph-client.rbd-mirror.a.log $PWD

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We have a random issue when rbd-mirror cannot connect to the remote
peer, and we time out waiting for daemon health after 600 seconds.

When this happens, we see ERROR status in rbd mirror pool status:

    $ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose
    health: ERROR
    daemon health: ERROR
    image health: OK
    images: 0 total

    DAEMONS
    service 4361:
      instance_id: 4408
      client_id: a
      hostname: dr2
      version: 18.2.2
      leader: true
      health: ERROR
      callouts: unable to connect to remote cluster

In rbd-mirror log we can see:

    8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin

    debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0  0 rbd::mirror::PoolReplayer: 0x5589c90dc000
    init_rados: reverting global config option override: mon_host:
    [v2:192.168.122.98:3300,v1:192.168.122.98:6789] ->

    unable to get monitor info from DNS SRV with service name: ceph-mon

    debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp

    debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot
    identify monitors to contact

After restarting the daemon it works normally.

Add a workaround restarting the rbd-mirror daemon if mirroring health is
not OK after 180 seconds.  We try this 3 times, and fail if mirroring
health is still not OK after the last attempt.

Example log showing the workaround in action:

1. Attempt 1 times out

    2024-04-09 15:31:37,070 DEBUG   [rdr/0] Waiting for mirroring health in cluster 'dr1' (1/3)
    2024-04-09 15:31:37,259 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'UNKNOWN', 'health': 'UNKNOWN', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:31:40,845 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'UNKNOWN', 'health': 'UNKNOWN', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:32:18,270 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:32:37,404 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:33:18,557 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:33:37,561 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:34:19,089 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:34:37,226 DEBUG   [rdr/0] Timeout waiting for mirroring health in cluster 'dr1'

2. Restarting the rbd-mirror daemon

    2024-04-09 15:34:37,226 DEBUG   [rdr/0] Restarting deploy/rook-ceph-rbd-mirror-a in cluster 'dr1'
    2024-04-09 15:34:37,391 DEBUG   [rdr/0] deployment.apps/rook-ceph-rbd-mirror-a restarted
    2024-04-09 15:34:37,395 DEBUG   [rdr/0] Waiting until deploy/rook-ceph-rbd-mirror-a is rolled out in cluster 'dr1'
    2024-04-09 15:34:37,597 DEBUG   [rdr/0] Waiting for deployment "rook-ceph-rbd-mirror-a" rollout to finish: 0 out of 1 new replicas have been updated...
    2024-04-09 15:34:37,622 DEBUG   [rdr/0] Waiting for deployment "rook-ceph-rbd-mirror-a" rollout to finish: 1 old replicas are pending termination...
    2024-04-09 15:34:41,475 DEBUG   [rdr/0] Waiting for deployment "rook-ceph-rbd-mirror-a" rollout to finish: 1 old replicas are pending termination...
    2024-04-09 15:34:41,562 DEBUG   [rdr/0] deployment "rook-ceph-rbd-mirror-a" successfully rolled out

2. Attempt 2 succeeds

    2024-04-09 15:34:41,568 DEBUG   [rdr/0] Waiting for mirroring health in cluster 'dr1' (2/3)
    2024-04-09 15:34:41,742 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:35:19,509 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {}}
    2024-04-09 15:35:19,510 DEBUG   [rdr/0] Cluster 'dr1' mirroring healthy in 37.94 seconds

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
It is a list/array of elements and as per Google style guide it must be
plural.

Refer to https://google.github.io/styleguide/jsoncstyleguide.xml?showone=Singular_vs_Plural_Property_Names#Singular_vs_Plural_Property_Names

Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
Signed-off-by: youhangwang <youhangwang@foxmail.com>
Signed-off-by: youhangwang <youhangwang@foxmail.com>
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
* add e2e framework

Signed-off-by: jacklu <jilu@redhat.com>
Signed-off-by: Annaraya Narasagond <annaraya.narasagond@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants