Skip to content

Add GitLab and GitHub CI tests#133

Open
rpgoldman wants to merge 5 commits intologic-and-learning-lab:mainfrom
rpgoldman:main-with-tests
Open

Add GitLab and GitHub CI tests#133
rpgoldman wants to merge 5 commits intologic-and-learning-lab:mainfrom
rpgoldman:main-with-tests

Conversation

@rpgoldman
Copy link
Contributor

These two tests aim to do the same thing. The GitLab actions run on SIFT's local GitLab repo, and the GitHub actions are intended to run on GitHub.

The tests use SIFT's prt regression test framework and run on the examples distributed with Popper. prt is had in the Docker image. The Docker image is built from the popper-prt repo, the source for which we will make available.

@rpgoldman rpgoldman force-pushed the main-with-tests branch 2 times, most recently from 31d060c to 598a84b Compare January 6, 2026 04:26
@andrewcropper
Copy link
Collaborator

is this different to #126 ?

@rpgoldman
Copy link
Contributor Author

This is the better one. I need to check and see why adding setuptools as a dependency caused a test breakage here

[Notes mostly for myself:]

Here's the log with error and backtrace:
synthesis-length.log

Failure is here:

                    if size >= settings.max_literals:
                        assert(False)

I'm not sure that this is deterministically repeatable.

@rpgoldman
Copy link
Contributor Author

I reran the tests and now they all pass. That implies there's something non-deterministic here. The open question is: "did I break something and make it non-deterministic, or was it always non-deterministic?"

If this is non-deterministic, and there's an error condition here, does that mean that this condition (the size being too large) should be handled differently? I.e., should Popper treat this as a normal kind of learning failure, backtrack, and try a different hypothesis (possibly with new constraints), or should Popper be tweaked so that this error condition cannot happen?

The secondary question is whether this behavior invalidates the test structure here. I.e., do you want to go ahead and merge this, and perhaps add a separate issue for fixing the size overflow? It's possible that we should treat this test as an expected failure, until that can be fixed (since we would like to be able to use the tests as a way to vet Popper changes).

You are obviously the person to make these judgment calls.

@andrewcropper
Copy link
Collaborator

The open question is: "did I break something and make it non-deterministic, or was it always non-deterministic?"
I do not know, but I have not seen this error before. And the code should never reach that condition.

I think some of your code has changed.

Looking at the log, when Popper generates this hypothesis:

2.8 s INFO: f(V0,V1):- zero(V1),empty(V0).
2.8 s INFO: f(V0,V1):- tail(V0,V3),f(V3,V2),succ(V2,V1).

Then it should have teach this code point and returned:
https://github.com/rpgoldman/Popper/blob/598a84b9567008da6eaf13fabd433741023d4274/popper/loop.py#L203

I do not see how any non-determinism could prevent that.

I just tried your code but I cannot replicate the crash.

@rpgoldman
Copy link
Contributor Author

@andrewcropper I've just run this 20 times locally, and the test on synthesis-length passes every time. I have just rebased this PR onto your updated main. The only differences between this PR and main now are in the github and gitlab actions and setup.py. So I cannot have done anything to break Popper itself.

I note that your fixes to alan.pl were not in place when the test failed. Perhaps that was related? At any rate, this looks safe to merge (see the "Files changed" to verify).

@rpgoldman rpgoldman marked this pull request as ready for review February 4, 2026 01:11
@rpgoldman
Copy link
Contributor Author

OK, I just ran all the tests locally and everything looks fine. But I got a different failure on GitHub. I wonder if this has to do with running the tests in parallel on the GH runner? I will see if tweaking the way things run helps. So for now, I am marking this as draft again.

@rpgoldman rpgoldman marked this pull request as draft February 4, 2026 02:30
@andrewcropper
Copy link
Collaborator

I have no idea and I do not know what CI/GH runner etc are, but I am confident that the bug is not a standalone Popper one.

@rpgoldman
Copy link
Contributor Author

I have no idea and I do not know what CI/GH runner etc are, but I am confident that the bug is not a standalone Popper one.

I agree — I’m not seeing this issue locally or in the GitLab CI.

GitHub runner has only 4 cores, so lower maxparallel from 16 to 4.
@rpgoldman
Copy link
Contributor Author

@andrewcropper I think that I have fixed the problem on GitHub: the problem appears to have been related to the fact that I was trying to run 16 concurrent jobs on a machine (the GitHub "runner") that has only 4 cores.

I dropped the number of cores and now the tests pass reliably.

I hope that this will help you in maintaining and improving Popper, because it provides a way to detect changes that degrade the behavior of Popper. Of course, this kind of end-to-end test can only do so much, but it should make it easier to move 3 steps forward without moving 2 steps back!

@andrewcropper andrewcropper marked this pull request as ready for review February 23, 2026 09:07
@andrewcropper
Copy link
Collaborator

Super basic questions, which I are related to this message:

The tests use SIFT's prt regression test framework and run on the examples distributed with Popper. prt is had in the Docker image. The Docker image is built from the popper-prt repo, the source for which we will make available.

Q1. Where are the actual tests? In a docker image that is not yet available?
Q2. How are the tests actually ran?
Q3. How will this all be maintained? Does the docker image need revising when Popper changes?

@rpgoldman
Copy link
Contributor Author

Super basic questions, which I are related to this message:

The tests use SIFT's prt regression test framework and run on the examples distributed with Popper. prt is had in the Docker image. The Docker image is built from the popper-prt repo, the source for which we will make available.

Q1. Where are the actual tests? In a docker image that is not yet available?

The actual tests are defined in the popper-prt system. They are in the docker image, but can be inspected in that system (and a new Docker image may be created easily).

More specifically, the *.prt files have the test definitions. There is one of these for each of the directories in Popper's examples. The test definitions employ the definitions in defaults.pl, which they include. I have just added more information to the popper-prt repo.

Q2. How are the tests actually ran?

GitHub itself will run the tests after every push. It will also check merge requests and complain if the code in the corresponding branch does not pass the tests. This is the primary case for having these tests.

The README.md for popper-prt explains how to use the docker image to run the tests against a local copy of Popper.

The docker image makes things easier to use, but also overcomes a minor organizational issue. My company is making the prt library available open source, but this requires someone doing the drudge work of getting the right license text spattered everywhere before the source repository can be put on GitHub. I continue to push for this to happen, but unsurprisingly, there are few volunteers for this task. The docker image allows us to stop waiting...

prt is essentially a set of relatively simple perl scripts that runs one or more programs for you, storing the results, and then uses regular-expression filtered matching to compare those results against previous results. So most of the tests here check the True Positive, False Positive, True Negative, False Negative output lines against previous results. The checks for the noisy tests are slightly more complex, because they match modulo some tolerance.

Q3. How will this all be maintained? Does the docker image need revising when Popper changes?

The tests need no maintenance until there's a version of Popper that provides improved performance on one or more of the existing examples, or there are new examples to add to the test suite. In the former case, updated rubric file(s) would need to be stored in the popper-prt repository (which you could fork or, for that matter, we could move into the logic-and-learning-lab GitHub group). If there are new examples, one would need to add a new prt test file (you will see that these are very simple: most are only 7 lines long), and a new rubric directory, which is created by saving the result of a prt run (before there is a rubric to match, you run the test, it fails, and you take the corresponding directory of results and move it into rubrics).

The main idea is to make it possible to improve Popper while being confident that one is not breaking anything.

If there are any problems, you can post them here or as issues on the popper-prt repo. That would include questions about the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants