fix: Sort rsync include/exclude options according to specificity (#561, #1420) by DerekVeit · Pull Request #1895 · bit-team/backintime

DerekVeit · 2024-10-04T18:59:33Z

Also see follow-up issue #2355

This branch is to address issues #561 and #1420 and possibly other circumstances that might arise where the existing strategy for ordering the --include= and --exclude= options to rsync don't quite work.

The existing code has ordered the options in groups. There has been some trial-and-error trying to get that ordering correct where fixing it for one case can easily break another. This proposed fix is to solve it with a more general strategy.

The general principle is that the user would expect more-specific configuration to supersede less-specific. For example, if one rule (include or exclude) applies to "/home" and another rule contradicts that for "/home/user", it is straightforward to assume that the user wants the second rule to apply to "/home/user" and the first rule to apply to anything else in "/home". The rsync command follows its include/exclude options as first-match-wins, so adding the options in order from most to least specific, irrespective of whether they are including or excluding, should give us the correct result. This is accomplished by reverse sorting the paths plus some special treatment for glob patterns and relative paths.

The only change is in Snapshots.rsyncSuffix, and the sorting is done in a new method, Snapshots.pathSelection. (It looks like I should rename this new method in snake case for the project's movement toward PEP 8.)

There are integration-level tests (test_rsync_selections.py) that check various cases, including for #561 and #1420. They call the Snapshots.backup method. For now, the revised rsyncSuffix still uses the original strategy unless a SELECTIONS_MODE attribute of the Config object exists and is set to "sorted". The tests use this to test cases with both strategies, original and sorted.

To make it easy to test various cases of existing file structures and include/exclude configurations, there is a test/filetree.py module for describing file structures with a string, e.g.:

"""
    var/
        log/
            dmesg
            syslog
"""

And the test cases are specified in a text format that uses those strings, e.g. in test/selection_cases. This is to make it easy to add more cases and also to make it very clear what each case is testing.

:exclude-name
    includes
        /var
    excludes
        log
    files_tree
        var/
            lib/
                foo
                log
            log/
                dmesg
                syslog
    expected_tree
        var/
            lib/
                foo

The tests use pytest for parametrize and fixtures, and there are fixtures defined in test/conftest.py for the Config object and the Snapshots object.

For some verbose logging from the test functions there is a simple log function in test/logging.py.

The new code does not yet include handling the EncFS feature. This might be as simple as adding encode = self.config.ENCODE and applying that in the right places as in rsyncInclude and rsyncExclude, but I haven't looked at that closely enough to see what I need for testing it, so I have left it alone so far. In this respect, the new code is definitely not ready. (I know the consideration of removing EncFS is only a maybe-sometime idea for now.)

I'm interested in BiT maintainers' thoughts, questions, and opinions.

SELECTION_MODE is probably only for temporary development testing purposes and can be eliminated, and rsyncSuffix might be simplified. The note I put in the docstring might be addressed or removed.

Based on re-reading CONTRIBUTING.md, I will convert the new tests from pytest to unittest. Let me know if you would prefer that I not add the ddt library for the parameterizing.

The pathSelections method is more sprawling than I would like.

The log function is not essential to anything. I find it useful to have easy, verbose logging from my tests, especially while working on them, but it can be removed.

I have a note in pathSelections suggesting that the GUI somehow alert the user if they include and exclude the exact same path. I think the GUI should refuse to accept the configuration update until the conflict is resolved too.

I thought I had read the CONTRIBUTING guidelines already, but I see some items in those for me to address.

Ironic that I have always used mostly single quotes and just earlier this year forced myself to start using mostly double quotes to be more aligned with what I started to think was a bit more generally favored, e.g. with black. Easy fix, though. I do use black on my new code, interactively, not blindly. It does have a setting to not force the double quotes.

…cases.

).

…for_cases.

…skipping.

…matting.

buhtz

Hello,
much time has passed. Sorry. Before opening something new I thought it is much better to finish some open PRs. I re-read the whole conversation, now with a bit more understanding on my side. 😆

I am not an expert when it comes to rsync. When I understand you and your code correct, you do not modify the order of the "option groups" (location of include and exclude parameters) but you sort that groups each one in itself?

Seems to make sense to me. But I wonder if an rsync expert might see a problem? Would you mind to describe your approach and the problem behind at rsync mailing list and ask for feedback?

I take this serious and I am a bit scared if this might have a negative impact on backup behavior. It might cause unexpected behavior in some use cases.

Can you imagine any use case where your approach might be a problem? Or a use case where an existing backup profile behave different with your approach without the user acknowledge?

From your initial description of the problem and your solution I sense you as an expert (more than me). So I trust your expertise in the end.

Some details:

logging.py : Yes, please remove this. Python has its own logging module, no matter that BIT isn't using it yet. And there is also the good old print() for debugging and temporary output.
That filetree.py and the ddt pyyaml dependencies somehow bothering me. I don't understand it. And I need to invest a lot of time to understand it. Others need to do that, too. I good test need to be understandable in itself, in best cases. But I am open minded. I still don't understand what the purpose is. You define in the yaml file what files should be created? Why not using regular Python code for this?
It is my strong opinion that a test do not need to be easy to write, but easy to read. It need to be self explaining. But separating code in extra modules (filetree.py) and yaml files does not help with this point.

It would help me to understand your productive code If I would see tests for it. But the current tests are not understandable for outsiders.

What do you think? How can we proceed further?

buhtz · 2025-04-27T18:55:11Z

common/snapshots.py


        return ret

+    def pathSelections(self):


Just based on a gut feeling, I would say this method can be split into several methods.
Can the sorting part be separated from "self.config" and any other BIT-specific code? And than make that a module function instead of a method.

buhtz · 2025-04-30T12:27:20Z

Just an idea? Is there a good reason against using rsync --dry-run in the unit tests? OK, the files and dirs need to be created in the real file system. But then we can parse rsync's dry-run output without wasting resources on copy something.

DerekVeit · 2025-04-30T13:46:50Z

That's worth considering, but I think that given

the need to write real files to begin with
small number of files and minimal amount of data
efficiency of Linux filesystem cache
speed of contemporary drives, i.e. SSD

that the test time saved is probably not worth complicating the tests with the dry-run output handling. I don't mean the difficulty or work of parsing it, which would be trivial, but that that output would be a proxy for correct behavior rather than real thing, which adds another layer of consideration (opportunity for error) when making sure that the tests are testing what we think they're testing, especially around special cases. Really copying the files has the simplicity that the resulting files should be the same as the source files, so that there is an apples-to-apples comparison. And no special knowledge about dry-run behavior is introduced. Also, the dry-run strategy would increase how much the tested code is being run in a "test mode," which I think is best to minimize.

I want to address your comments of a few days ago but will have to get back to you maybe later this week. I do want to finish up this PR too.

buhtz · 2025-04-30T14:30:30Z

Agreed. 😄

buhtz · 2026-01-11T08:19:00Z

Hello Derek,
I need to reject this specific PR. But I don't reject the idea itself.

The PR is to big and complex.
The style of tests is to complex, hard to understand and maintain.
To much risk of introducing new issues. We need more and more clear information and tests to judge the impact of potential solutions.
To old and lack of response and resources by contributor.
Currently no clear concept about an end-user friendly solution. I am not convinced that BIT users should even be bothered with ordering of rsync rules. This hell should be hidden from them.

I opened a meta issue #2355 to keep track of the situation.

buhtz · 2026-01-12T08:12:07Z

Re-open discussion: #2355 (comment)

DerekVeit · 2026-01-12T17:50:06Z

The change to the production code is just at the Snapshots.rsyncSuffix method and a helper method added to it. But testing this aspect of the application is something needed independently of that change, so I would like to move it to a separate PR that comes first.

I have used the parametrzied/ddt-style of tests quite a lot, so it seems natural to me, but I can imagine how it feels like a big complication to introduce. I am removing the ddt and the YAML.

The SELECTIONS_MODE was just for temporary use in development, to allow running the tests with both the new and original implementations of rsyncSuffix. Making the testing part a separate PR will achieve mostly the same thing unless we want to keep fallback capability.

buhtz · 2026-01-12T19:43:56Z

Hello Derek

The change to the production code is just ...

I understood that and can see it in the PR's diff.

I have used the parametrzied/ddt-style of tests quite a lot

I am aware of the concept of parametrized and data driven tests. I am using that too where ever it makes sense. It also make sense in this case.

But the implementation in this case is the problem. Two new tools and one new language (yaml). This need to be understood by others, not only me but also other contributors and future maintainers. No one will or want to read extra documentation just to understand a test. Imagine such a test fails one day. No one will understand it. No one can fix it. They will just silence the failing test.

There is no problem with using DDT style tests. But please use vanilla Python wherever it is possible. Might look "ugly" (not pythonic) and might be much more lines of code. But I don't care about things like this in case of test code when it helps with understanding that code.

BIT as a project is FOSS and tries to be accessible for low-level programmers. Imagine a first semester computer science student or a 15 year old school teen.

Oh, and I just saw that you using pytest-style tests. The only thing I love about pytest is the colorful output of the test runner (the cli tool). Please don't use pytest-style tests in BIT. IMHO pytest-style tests are against the concept of "valuable tests" (see Khorikov (2020) Unit Testing - Principles, Practices, and Patterns). It is a big topic I could write a lot about. 😄 In short: Such tests are ugly, to implicit and hard to understand by outsiders. And the pytest developers do disqualify themself on every congress and youtube video because they make it the first and biggest "advantage" of pytest, that you can write tests with less lines of code compared to vanilla unittest module. Again: Number of lines do not count when it comes to unit tests.

Sorry, it is kind of an religious topic. 🤣

You can keep YAML if it spares some time and resources. But write it into the test method, not somewhere else and especially not in an extern file.

What is about split_indent()? Does it manipulate the indent in the YAML content? Looks like a workaround because YAML can't handle the problem itself?

And even from a technical point of view I still do not understand how all these components (ddt binary, yaml, non-test files, ...) are working together while the tests are running. When is what executed by what etc pp?

The topic of this issue/PR is complex, so the tests will also be complex and not easy to understand. They won't be perfect. But they don't need to be extra complex and difficult. 😄 🤟

I suggest your next PR has just one or two tests. And then lets see If I can get it into my stubborn head.

The SELECTIONS_MODE was just for temporary use in development,

Ok.

buhtz · 2026-01-12T20:03:51Z

Why not write file system structures using dict instead of multiline strings? Easy to read, understand and syntax highlighted.

fs = {
    "var": {
        "log": {
            "bar": {
                "bar.log": None,
            },
            "syslog": None,
        }
    }
}

# or lists for files
alternative = {
    "var": {
        "log": {
            "bar": ["bar.log", "foo.log"],
            "syslog_is_empty": [],
        }
    }
}

DerekVeit · 2026-01-12T20:06:07Z

Above you can see that I converted the tests from pytest to unittest back in October 2024.

Don't worry. I agree completely with your opinion about the value of making the code as obvious as possible -- especially for tests and especially for an open source project -- even if my code hasn't all been the greatest demonstration of it. Your feedback has been good.

DerekVeit · 2026-01-12T21:07:20Z

Why not write file system structures using dict instead of multiline strings?

I find that much less readable, both visually and for how to understand or employ its conventions. It would still need functions to convert it to and from files. And the string representation allows using assertEqual on the strings, which gives a very simple and readable diff when there is a failure.

buhtz · 2026-01-13T14:06:15Z

I converted the tests from pytest to unittest back in October 2024.

I can't see. Aren't the tests in test_filetree.py and test_logging.py still pytest-style tests?

Why not write file system structures using dict instead of multiline strings?

I find that much less readable, both visually and for how to understand or employ its conventions. It would still need functions to convert it to and from files.

OK, you got me. Stick to the multiline strings.

gives a very simple and readable diff when there is a failure.

I do ❤️ that.

DerekVeit · 2026-01-13T14:40:27Z

Yes, I see now that I overlooked those tests of the testing helpers when I converted the tests to unittest. And they also have parametrized tests. I will convert those tests of the test helpers too.

Derek Veit added 30 commits October 3, 2024 16:30

Add test.logging with log function.

ffda525

Add bit_config pytest fixture.

550b786

Add bit_snapshot pytest fixture.

6d1594e

Add test.filetree.

3d1453e

Add test_rsync_selections.

1109687

Make-adjusted python paths.

0e90172

Snapshots.rsyncSuffix: docstring note about extra parameters.

71dedd0

conftest.py: add a blank line.

2908bcd

filetree.py: add docstrings.

1af7596

filetree.py: wrap 2 long lines.

76a0b63

test.logging: add docstring.

fc560c0

test.logging: add horizontal lines.

139f6f7

test_rsync_selections: load the cases in the params functions.

d8bed28

test_rsync_selections: params_for_cases: add docstring and comments.

d786ad6

test_rsync_selections: combine params_for_cases and params_for_raise_…

36a5c50

…cases.

conftest: update for the refactoring of Config.setSnapshotsPath (67e93bd

fbd89a1

).

test_rsync_selections: specify selection mode(s) when calling params_…

25d4151

…for_cases.

selection_raise_cases: remove "both-implied-dir" case.

5217805

selection_cases: only use absolute path in excludes.

a1552a3

selection_cases: update cases with original failing.

c15ecae

selection_cases: add "various" case.

b1e03f7

snapshots.py: add Snapshots.pathSelections method.

0140b22

snapshots.py: use pathSelections if SELECTIONS_MODE == "sorted".

61285c2

snapshots.py: Snapshots.takeSnapshot: save cmd.

f635129

test_rsync_selections: params_for_cases: allow some conditional case …

c45370f

…skipping.

test_rsync_selections: add update_config.

625cedd

test_rsync_selections: use update_config.

50dacc9

test_rsync_selections: remove add_includes.

d5c5671

test_rsync_selections: use a subdirectory for the sample files.

ae4dc36

test_rsync_selections: test_rsyncSuffix__raises: some log message for…

b04df4c

…matting.

buhtz added 2 commits April 26, 2025 15:25

Merge branch 'dev' into path_selections

7f6e91d

Merge branch 'dev' into path_selections

8d51323

buhtz requested changes Apr 27, 2025

View reviewed changes

buhtz added the Discussion decision or consensus needed label Apr 27, 2025

buhtz modified the milestones: 3nd release from now, 2nd release from now Apr 27, 2025

buhtz added 4 commits June 3, 2025 09:38

Merge branch 'dev' into path_selections

42680ad

Merge branch 'dev' into path_selections

9770456

Merge branch 'dev' into path_selections

8000672

Merge branch 'dev' into path_selections

70e673a

buhtz mentioned this pull request Jan 11, 2026

Rule ordering (--exclude and --include) in Rsync #2355

Open

buhtz closed this Jan 11, 2026

buhtz added the PR: Unlikely to merge PR is unlikely to be merged due to project goals, technical limitations, or other constraints. label Jan 11, 2026

buhtz reopened this Jan 12, 2026

buhtz removed the PR: Unlikely to merge PR is unlikely to be merged due to project goals, technical limitations, or other constraints. label Jan 12, 2026

bit-team deleted a comment from DJ-wackong Feb 11, 2026

buhtz modified the milestones: 2nd release from now (maybe 2.1.*), Upcoming next (2.0.0) Feb 20, 2026

Conversation

DerekVeit commented Oct 4, 2024 • edited by buhtz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

buhtz left a comment

Choose a reason for hiding this comment

Uh oh!

buhtz Apr 27, 2025

Choose a reason for hiding this comment

Uh oh!

buhtz commented Apr 30, 2025

Uh oh!

DerekVeit commented Apr 30, 2025

Uh oh!

buhtz commented Apr 30, 2025

Uh oh!

buhtz commented Jan 11, 2026

Uh oh!

buhtz commented Jan 12, 2026

Uh oh!

DerekVeit commented Jan 12, 2026

Uh oh!

buhtz commented Jan 12, 2026

Uh oh!

buhtz commented Jan 12, 2026

Uh oh!

DerekVeit commented Jan 12, 2026

Uh oh!

DerekVeit commented Jan 12, 2026

Uh oh!

buhtz commented Jan 13, 2026

Uh oh!

DerekVeit commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DerekVeit commented Oct 4, 2024 •

edited by buhtz

Loading