patch: advanced rolling ops using etcd by patriciareinoso · Pull Request #364 · canonical/charmlibs

patriciareinoso · 2026-03-16T20:00:25Z

Context

All the code on this PR is new

This implementation is based on DA241 Spec

The content of charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py belongs to another library that is currently being migrated to charmlibs so you can ignore it for now.

Summary

This PR is the first part of the implementation of advanced rolling ops at cluster level.

This PR includes:

Management of the client certificate used to connect to etcd
- The leader unit creates a self-signed certificate with live of 50 years
- Share the certificate with the other units using a peer relation
Implementation of the integration with etcd
- Share the mtls certificate
- Observe the resource_created event
- Observe the endpoints_changed event
Management of a env file needed to connecto etcd via etcdctl

This PR does not implement the locking mechanism. In here we only test that we can connect to etcd from each unit.

Current workflow:

The unit make a request
A new background process is spawn
The background process dispatches a Juju hook
The unit observes that hook
The unit writes and read a value in etcd
If the unit was able to connect to etcd, it executes the "restart" function.

This is a very simplified workflow to be able to test that the units from different apps can reach etcd.

To do

Implement the actual locking mechanism
Figure out how to properly install etcdctl

advanced-rollingops/tests/integration/test_etcd_rolling_ops.py

james-garner-canonical · 2026-03-16T20:11:43Z

Hi @patriciareinoso thanks for opening this PR. I haven't looked through the code yet, but I wanted to start with a request for some additional context first.

Please update the PR to describe whether this is a migration of an existing codebase or should be treated as all new. If it's a migration, please link to the existing library, and describe what's changing/open to change. If there was a spec involved in the design of this new lib, please also link to it.

You've probably already seen it, but I just want to also drop a link to our migration guide https://documentation.ubuntu.com/charmlibs/how-to/migrate/

If you'd like to talk about any of this synchronously, feel free to set up a meeting, my availability is in my calendar. Otherwise we'll proceed with asynchronous review when we have the context.

PS: If the PR isn't ready for review yet, you can mark it as a draft and manually re-request review when ready.

Gu1nness

Mostly questions around security and try/catch errors.
Excellent work!

advanced-rollingops/src/charmlibs/advanced_rollingops/__init__.py

rollingops/src/charmlibs/rollingops/_certificates.py

Gu1nness · 2026-03-17T13:16:26Z

rollingops/src/charmlibs/rollingops/_certificates.py

+
+        ca_key = PrivateKey.generate(key_size=4096)
+        ca_attributes = CertificateRequestAttributes(
+            common_name='rollingops-client-ca', is_ca=True


Shouldn't that be common_name ? Or something decided by the charm integrating ?
Or is it because it can't be chosen in a distributed way ?

In addition to Neha's comment on the common_name, it might be worth adding add_unique_id_to_subject_name=False to the attributes.

rollingops/src/charmlibs/rollingops/_certificates.py

Gu1nness · 2026-03-17T14:38:10Z

rollingops/src/charmlibs/rollingops/_manager.py

Pinning to this file but it's a more generic question: Is is possible for a unit to request a lock for another unit by "hacking" the etcdctl calls that request a lock ? What would be the consequences ?
Is there a way to protect from that ?

Yes, a unit can hack the etcdctl to request a lock for another unit. In the app-level solution we have isolation in databag (a unit cannot write in another's unit databag) but here we do not have this protection.

The only protection we have is: a unit cannot request a lock for a unit in another cluster ID which is done using prefixes and is automatically handled by etcd charm.

It would be nice to have an extra prefix assignment to a given unit, so that this cannot happen. But I don't think there is a straightforward (already implemented) way to do it.

if we are able to get the certificate and connect to etcd, any unit can make another unit execute an operation known by the attacked unit which would not be very nice.

Then, two things:

I'm afraid this will get abused by developers to trigger restarts/actions in other units/apps. And as of now, I don't see an easy way of getting around this either (I have some ideas using asymmetric cryptography but that would be a significant effort to bring).

We must ensure that the ids are secure enough and cannot be guessed because that could be a real issue: one compromised application/unit in a model could lead to a full DoS of the model (and we really don't want that).

Another question:
What would happen if I where to connect to etcd with my client app, requesting the prefix /rollingops.
Would I be able to list and export all locks that are requested by every application ?
It's probably something that etcd has to take care about, but I'm curious 👀

Gu1nness · 2026-03-17T14:43:09Z

rollingops/src/charmlibs/rollingops/_etcdctl.py

+        os.chmod(cls.ENV_FILE, 0o600)
+
+    @classmethod
+    def load_env(cls) -> dict[str, str]:


That function scares me.
I trust that it probably works but it does scare me that we're trying to parse some exports that way.
Why do we need to go through a dedicated file to do that ?
It seems to me that we're writing to the file just to read it as an environment provider for the command runner.

Isn't there another way to do that?

The parameters that are passed as env=cls.load_env() to the subprocess command could instead be appended to the command itself (example).

This would eliminate the need to write and load an env file, but would require access to relation data at runtime.

We will be running etcd commands from the background process, meaning that we won't have access to the relation data. That's why we need to write this information in a file.
I changed and now we write just the configuration in json format

rollingops/src/charmlibs/rollingops/_manager.py

advanced-rollingops/src/charmlibs/advanced_rollingops/_worker.py

Gu1nness · 2026-03-17T14:52:41Z

rollingops/src/charmlibs/rollingops/_manager.py

+        """Stop the etcd worker process in the current unit."""
+        unit = event.departing_unit
+        if unit == self.model.unit:
+            self.worker.stop()


So to be sure:

worker spawns one process and dies

Process triggers one dispatch and dies

Dispatch from the process prevents this event (relation-departed) from running (because only one event at a time).
Is it to cover race conditions ? Or are the workers incomplete for now ?

The worker is incomplete for now. It may run for a "long time" waiting for the lock to be granted. So when the etcd relation is removed we should stop it.

Gu1nness · 2026-03-17T15:15:36Z

Also I think you need to add a changelog file (cf https://documentation.ubuntu.com/charmlibs/explanation/charmlibs-publishing/ )

#364 has revealed that CI breaks with packages with a `requires-python` lower bound that's higher than 3.10 (it has `requires-python>=3.12`). This is because we would run `just integration-k8s` and `just integration-machine` in the `init` step to check if they have any tests, and that uses our default `justfile` Python version of 3.10. This PR fixes this by recording the minimum supported Python version when we calculate the Python versions to test the library with, and using that version in the failing step.

james-garner-canonical · 2026-03-17T23:23:34Z

CI failure should be fixed in #366

CODEOWNERS

advanced-rollingops/pyproject.toml

rollingops/src/charmlibs/rollingops/_manager.py

…E-9349-rolling-ops

james-garner-canonical · 2026-03-18T12:22:53Z

Integration tests should be fixed for real now with #373 merged, please merge main to pull in the fix.

reneradoi

In general the implementation looks good for the scope of the PR. I've left a few comments, mostly about error handling, and a few questions.

reneradoi · 2026-03-18T10:27:45Z

rollingops/src/charmlibs/rollingops/_certificates.py

+
+        ca_key = PrivateKey.generate(key_size=4096)
+        ca_attributes = CertificateRequestAttributes(
+            common_name='rollingops-client-ca', is_ca=True


In addition to Neha's comment on the common_name, it might be worth adding add_unique_id_to_subject_name=False to the attributes.

rollingops/src/charmlibs/rollingops/_certificates.py

reneradoi · 2026-03-18T11:37:02Z

rollingops/src/charmlibs/rollingops/_etcdctl.py

+        os.chmod(cls.ENV_FILE, 0o600)
+
+    @classmethod
+    def load_env(cls) -> dict[str, str]:


The parameters that are passed as env=cls.load_env() to the subprocess command could instead be appended to the command itself (example).

This would eliminate the need to write and load an env file, but would require access to relation data at runtime.

rollingops/src/charmlibs/rollingops/_etcdctl.py

rollingops/src/charmlibs/rollingops/_worker.py

rollingops/src/charmlibs/rollingops/_relations.py

rollingops/src/charmlibs/rollingops/_etcdctl.py

dimaqq · 2026-03-19T04:35:16Z

rollingops/src/charmlibs/rollingops/_dp_interfaces_v1.py

+try:
+    import psycopg2
+except ImportError:
+    psycopg2 = None


I wonder about vendoring this library as is.

From what I can see, postgres is not used by the rolling ops functionality, and this import is referenced in a method that's not ultimately used.

I guess this is intended to be temporary. From the PR description:

To do

Replace the content of charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py with the the actual library once it is migrated to charmlibs

Yes, this is just while we take _dp_interfaces_v1 out in its own standalone pip package.

this file was removed and now all the functions are in dpcharmlibs-interfaces dependency

Mehdi-Bendriss

Thanks Patricia, good work!

My main concerns are with regards: to exception handling and clear failure recovery mechanisms as well as making this lib substrate-agnostic.

(We should add more inline comments in the methods to better describe the flow as well as todos)

Mehdi-Bendriss · 2026-03-22T17:28:08Z

rollingops/src/charmlibs/rollingops/_certificates.py

+import os
+from datetime import timedelta
+from pathlib import Path


Please use pathops across the board for all file related operations, this way this will work for both VM and k8s substrates and the code won't diverge much from what you already have.

rollingops/src/charmlibs/rollingops/_certificates.py

rollingops/src/charmlibs/rollingops/_etcdctl.py

Mehdi-Bendriss · 2026-03-22T18:02:19Z

rollingops/src/charmlibs/rollingops/_models.py

+    ROOT: ClassVar[str] = '/rollingops'
+
+    cluster_id: str
+    owner: str


Let's add a new key lock_name defaulting to default or similar, to allow for extensibility of multiple locks later on.
And we do not expose it as property on the RollingOpsManager just yet.

rollingops/src/charmlibs/rollingops/_models.py

Mehdi-Bendriss · 2026-03-22T18:30:51Z

rollingops/src/charmlibs/rollingops/_manager.py

+        etcd_relation = self.model.get_relation(self.etcd_relation_name)
+        if not etcd_relation:
+            raise RollingOpsNoEtcdRelationError
+
+        EtcdCtl.ensure_initialized()
+
+        self.worker.start()


@patriciareinoso can you add TODOs in those missing parts?

Mehdi-Bendriss · 2026-03-22T18:32:02Z

rollingops/src/charmlibs/rollingops/_manager.py

+        proc = EtcdCtl.run(['get', self.keys.lock_key, '--print-value-only'], check=False)
+
+        if proc.returncode != 0:


we need to log the error here - and how do we recover? it seems like we just give up on the callback?

this is a dummy function just to test the connection to etcd. All the logic here will change in the next PR

Mehdi-Bendriss · 2026-03-22T18:33:02Z

rollingops/src/charmlibs/rollingops/_manager.py

+
+        value = proc.stdout.strip()
+        if value != self.keys.owner:
+            logger.info('Callback not executed.')


I'm not sure I understand this log line, since the callback invocation happens later

yeah I forgot a return there!

rollingops/src/charmlibs/rollingops/_manager.py

Mehdi-Bendriss · 2026-03-22T18:34:46Z

rollingops/src/charmlibs/rollingops/_manager.py

+        if not etcd_relation:
+            raise RollingOpsNoEtcdRelationError
+
+        EtcdCtl.ensure_initialized()


can you explain the expected failure and recovery mechanism?

In this case we have 4 possible exception raised.

RollingOpsNoEtcdRelationError, RollingOpsEtcdNotConfiguredError mean that the user is trying to request a lock before it is suposed to: relation is not there or etcd is not yet configured.

RollingOpsInvalidLockRequestError means that the lock request is invalid.

The 3 above means that it's a user mistake and they should handle those errors.

The 4th exception is PebbleConnectionError that can happen when trying to read the config file. For that case, we have a tenacity ensure_initialized -> on with_pebble_retry and after we give up, the exception is bubble up to the user.

james-garner-canonical

I just saw the version bump to 1.0.0.dev0. Based on what I'm seeing in the PR description about the completeness of the implementation and the comments about deferring work to a future PR, I think a 0.X working version is reasonable right now. Consider an initial zero version release to signal that the library may not yet be in a stable or fully usable state, but is ready for some use.

Do note that this initial library PR is when Charm Tech will be providing our main review, so please make sure that any deferred work is clearly designed and scoped for review. Charmlibs aims to provide a high quality collection of libraries that charmers can use with confidence, so we're not keen on landing WIP content in main without a clear plan for its pathway to a full release.

I haven't taken the time to review the code and design thoroughly yet due to the nature of the ongoing activity in the PR, but we will review before merging. Please ping me or re-request review when ready.

Gu1nness · 2026-03-27T12:44:38Z

rollingops/pyproject.toml

+]
+dynamic = ["version"]
+dependencies = [
+    # "ops",


Why is this commented ?

Gu1nness · 2026-03-27T13:41:22Z

advanced-rollingops/src/charmlibs/advanced_rollingops/_manager.py

-        Args:
-            callback_id: Identifier of the registered callback to execute when
-                the lock is granted.
-            kwargs: Optional keyword arguments passed to the callback when
-                executed. Must be JSON-serializable.
-            max_retry: Maximum number of retries for the operation.
-                - None: retry indefinitely
-                - 0: do not retry on failure


Maybe the args doc deserves to stay?

Gu1nness · 2026-03-27T13:44:00Z

rollingops/src/charmlibs/rollingops/_models.py

+
+
+@retry(
+    retry=retry_if_exception_type(PebbleConnectionError),


I think there are a few more exceptions you could catch here: APIError and ChangeError that can happen transiently.

Gu1nness · 2026-03-27T13:45:34Z

rollingops/src/charmlibs/rollingops/_models.py

+class SharedCertificate(NamedTuple):
+    """Represent the certificates shared within units of an app to connect to etcd."""
+
+    certificate: str
+    key: str
+    ca: str


Would this deserve a bit more structure? I think about the PrivateKey for example from charmlibs-interfaces-tls-certficates or something from cryptography?

Gu1nness · 2026-03-27T14:02:59Z

rollingops/src/charmlibs/rollingops/_worker.py

                / 'site-packages'
            )
            if path.stem == 'lib':
                new_env['PYTHONPATH'] = f'{venv_path.resolve()}:{new_env["PYTHONPATH"]}'


Maybe check that venv_path exists ?
This ensures we would not raise an OSError

Gu1nness · 2026-03-27T14:04:20Z

rollingops/src/charmlibs/rollingops/_worker.py

        worker = (
            self._charm_dir
            / 'venv'
            / 'lib'
            / f'python{version_info.major}.{version_info.minor}'
            / 'site-packages'
            / 'charmlibs'
-            / 'advanced_rollingops'
+            / 'rollingops'
            / '_etcd_rollingops.py'
        )


Same, maybe ensure this path exists ? Or do we consider that it should exist.

Gu1nness · 2026-03-27T14:07:53Z

rollingops/src/charmlibs/rollingops/_etcdctl.py

+    try:
+        with_pebble_retry(lambda: BASE_DIR.mkdir(parents=True, exist_ok=True))
+        with_pebble_retry(
+            lambda: CONFIG_FILE_PATH.write_text(json.dumps(asdict(config), indent=2), mode=0o600)
+        )
+    except (FileNotFoundError, LookupError, NotADirectoryError, PermissionError) as e:
+        raise RollingOpsFileSystemError('Failed to persist etcd config file.') from e


Maybe we should catch the RetryError from tenacity here as well ?

Gu1nness · 2026-03-27T14:11:34Z

rollingops/src/charmlibs/rollingops/_etcdctl.py

+
+def is_etcdctl_installed() -> bool:
+    """Return whether the snap-provided etcdctl command is available."""
+    return shutil.which(ETCDCTL_CMD) is not None


Could it happen that this call is executed multiple times in the same event?
If yes, we probably want to cache that result.

Gu1nness · 2026-03-27T14:15:03Z

rollingops/src/charmlibs/rollingops/_etcdctl.py

+        raise RollingOpsFileSystemError('Failed to remove etcd config file and CA.') from e
+
+
+def run(args: list[str]) -> str | None:


Easier and more idiomatic

def run(*args: str) -> str | None:

Gu1nness · 2026-03-27T14:19:44Z

rollingops/tests/integration/conftest.py

Does those changes mean that the template is generating something that doesn't make sense?
If yes, it's worth opening an issue

Gu1nness

Amazing work, nearly there for me!
Mostly a few more questions.
The only blocking point for me we need to discuss/fix is the security part to prevent DoS

patriciareinoso added 2 commits March 16, 2026 15:39

patch: etcd rolling ops version

70a4cca

first working version

b32878c

patriciareinoso requested a review from a team as a code owner March 16, 2026 20:00

patriciareinoso requested a review from james-garner-canonical March 16, 2026 20:00

patriciareinoso commented Mar 16, 2026

View reviewed changes

advanced-rollingops/tests/integration/test_etcd_rolling_ops.py Outdated Show resolved Hide resolved

patriciareinoso commented Mar 16, 2026

View reviewed changes

advanced-rollingops/tests/integration/test_etcd_rolling_ops.py Outdated Show resolved Hide resolved

patriciareinoso added 4 commits March 17, 2026 09:09

fix format

bf58a2d

fix linting

4d99740

add tenacity to integration test

a29b85d

remove unnecessary logs

de57674

Gu1nness requested changes Mar 17, 2026

View reviewed changes

add dataplatform as reviewes

effafe4

james-garner-canonical mentioned this pull request Mar 17, 2026

ci: support libs with requires-python higher than 3.10 #366

Merged

james-garner-canonical reviewed Mar 18, 2026

View reviewed changes

CODEOWNERS Outdated Show resolved Hide resolved

james-garner-canonical reviewed Mar 18, 2026

View reviewed changes

advanced-rollingops/pyproject.toml Outdated Show resolved Hide resolved

patriciareinoso requested review from Mehdi-Bendriss, reneradoi and skourta March 18, 2026 09:03

rename and add integration tests

3948b94

patriciareinoso commented Mar 18, 2026

View reviewed changes

rollingops/src/charmlibs/rollingops/_manager.py Outdated Show resolved Hide resolved

patriciareinoso added 2 commits March 18, 2026 12:04

Merge branch 'main' of https://github.com/canonical/charmlibs into DP…

1726070

…E-9349-rolling-ops

linting and rebase

b4f700c

james-garner-canonical mentioned this pull request Mar 18, 2026

ci: fix min Python version in integration tests #373

Merged

reneradoi reviewed Mar 18, 2026

View reviewed changes

dimaqq reviewed Mar 19, 2026

View reviewed changes

Mehdi-Bendriss requested changes Mar 22, 2026

View reviewed changes

patriciareinoso added 6 commits March 25, 2026 17:36

first part of comments

e352833

more comments answered

fab0563

more comments answered

7f91b87

fix linting job

d5e2d3b

fix UT

9d072e5

mark tests as only k8s

008c057

james-garner-canonical reviewed Mar 26, 2026

View reviewed changes

patriciareinoso added 3 commits March 27, 2026 10:47

fix integration tests

7777e42

use charmlibs apt

4ad3d7c

remove sans dns

c014ffd

Gu1nness reviewed Mar 27, 2026

View reviewed changes

rollingops/pyproject.toml Outdated

]

dynamic = ["version"]

dependencies = [

# "ops",

Copy link
Copy Markdown

Contributor

Gu1nness Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented ?

patriciareinoso added 3 commits March 27, 2026 13:47

add dependencies to .toml

5d91c44

add uv lock

4e2f31d

add wait in itnegration tests

e170323

Gu1nness reviewed Mar 27, 2026

View reviewed changes

patriciareinoso added 3 commits March 27, 2026 15:25

increate timeout

b24c103

increase log count

ec901c4

unlimited debug-log

5adbdca

		proc = EtcdCtl.run(['get', self.keys.lock_key, '--print-value-only'], check=False)

		if proc.returncode != 0:



		@retry(
		retry=retry_if_exception_type(PebbleConnectionError),

		raise RollingOpsFileSystemError('Failed to remove etcd config file and CA.') from e


		def run(args: list[str]) -> str \| None:

Conversation

patriciareinoso commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Summary

Current workflow:

To do

Uh oh!

Uh oh!

Uh oh!

james-garner-canonical commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gu1nness left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patriciareinoso Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gu1nness commented Mar 17, 2026

Uh oh!

james-garner-canonical commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

james-garner-canonical commented Mar 18, 2026

Uh oh!

reneradoi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

To do

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mehdi-Bendriss left a comment

patriciareinoso commented Mar 16, 2026 •

edited

Loading

james-garner-canonical commented Mar 16, 2026 •

edited

Loading

patriciareinoso Mar 25, 2026 •

edited

Loading