Smartly Apply Constraints During Cartesian Product by Scienfitz · Pull Request #773 · emdgroup/baybe

Scienfitz · 2026-03-31T00:40:44Z

This PR implements a more optimized Cartesian product creation in the presence of cosntraints which can result in memory and time gains of many order of magnitudes (see mini benchmark below).

Rationale

Currently constraint filtering is done only after the entire search space has been created. This means the memory needed for the intermediate df is potentially huge even if the final df is tiny. In practice this had led to many problems when working with slot based mixtures, even if the optimized from_simplex constructor was used
Instead, any given cosntraint can be applied early during the parameter-by-parameter cross join operations. There are three tiers of applying this:
1. As soon as possible filter: A constraint can be applied as soon as all of its affected parameters are in the current crossjoin-df. After this application the constraint is fully ensured and does not have to be applied again. If the order in which cross join goes over the parameters is optimized this would already lead to an improvement as subsequent operations "see" much smaller left-dataframes.
2. Partial/early filter:
- Some constraints can be applied even if not all affected parameters are present yet.
- Example: no label duplicates - even if there are just 2 out of 7 parameters present, we can remove the rows that have duplicates in the 2 parameters.
- This filter has to be repeated in every loop iteration until it ran with all affected parameters present. However, the cost of multiple filter applications are dwarfed by the savings from smaller cross join operations.
- Whether a constraint supports early filtering might depend on its configuration, example: exlclude constraint with combiner (early filter supported for OR, not for AND or XOR)
  3 Look ahead: Some constraints can look ahead based on the possible parameter values that might be incoming and recognize that constraints cannot be fulfilled even in future crossjoin iterations.
- This is essentially what from_simplex implements for the very special case of 1 global sum constraint and 1 cardinality constraint. If we ever implement look-ahead filters for all constraints the from_simplex constructor might become obsolete
- For this we would need access to the parameter values inside the constraint logic, which might be easier to implement once the constraints have been refactored Refactor General Constraint Interface #517
This PR implements smart filtering for tiers 1 and 2. I left IMPROVE notes to remember about tier 3. To achieve this
- Constraint.get_invalid was extended to handle situations where not all parameters are in the df to be filtered. The constraint can the decide whether it can apply early filtering or returns the new UnsupportedEarlyFilteringError if it needs all parameters present
- The crossjoin is done in a custom loop inside parameter_cartesian_prod_pandas_constrained which itself performs the process described above after deciding on a smart parameter order for the crossjoin

Good To Know

new Constraint method has_polars_implementation, discussion here
new Constraint property _filtering_parameters, discussion here
strange appearance of DiscreteNoLabelDuplicatesConstraint in DiscretePermutationInvarianceConstraint .get_invalid explained here

Mini Benchmark:

Scenario 1: 7 categoricals with 8 values each and a no label dupe constraint
Scenario 2: complex slot-based mixture with 6 slots, 3 subgroups, sum constraints and additional product parameters
tested on (old main vs this branch) x (polars on/off)
30min as max runtime for a quick test

Scenario	Polars	main	feature	Speedup	Memory reduction
1: `from_product`, 7×8 cat, NoLabelDuplicates (2M→40K rows)	OFF	61.4s / 636 MB	6.6s / 48 MB	9.3x	13.2x
	ON	1.8s / 73 MB	1.6s / 47 MB	1.1x	1.6x
2: `from_simplex`, 6-slot mixture + 3 extras (~12B→22K rows)	OFF	>30min	0.5s / 17 MB	>3600x	—
	ON	>30min	0.5s / 17 MB	>3600x	—

…duct

Scienfitz · 2026-04-02T00:21:17Z

baybe/constraints/base.py

+        return set(self.parameters)
+
+    @property
+    def has_polars_implementation(self) -> bool:


this is a variant of giving such a property to constriants

alternatives would be:

mixin class: downside of complicated inheritance

class variables: manually maintained and not that pretty

Scienfitz · 2026-04-02T00:22:44Z

baybe/constraints/base.py

        # TODO: Should switch backends (pandas/polars/...) behind the scenes

+    @property
+    def _required_filtering_parameters(self) -> set[str]:


this is a private helper that might become obsolete in refactoring: For most constraints the parameters they operate on are simply self.parameters. There are 2 exceptions, which simply override this property

This method will likely be removed if there is ever a refactored smarter interface already

Scienfitz · 2026-04-02T00:25:14Z

baybe/constraints/discrete.py

+        # label-dedup part (which is always safe incrementally) is applied.
+        if self.dependencies:
+            if not self._required_filtering_parameters <= cols:
+                return DiscreteNoLabelDuplicatesConstraint(


this occurrence of DiscreteNoLabelDuplicatesConstraint might seem very random here

It is here because DiscretePermutationInvarianceConstraint includes tha auto-applciation of the label deduplication. In #626 I wrote a new example which made me aware that this is wrong. That PR also removes it. However, in this PR its still included and for consistency this is added here. Will be consolidated and likely removed when both are merged.

Copilot

Pull request overview

This PR optimizes discrete search space construction by applying discrete constraints incrementally during Cartesian product generation (including improved Polars/Pandas interop), aiming to reduce intermediate memory use and runtime for highly constrained spaces.

Changes:

Added baybe.searchspace.utils with shared Cartesian product helpers and a new incremental constrained-product builder.
Extended discrete constraint interfaces to support (or explicitly refuse) early filtering via UnsupportedEarlyFilteringError, plus a has_polars_implementation capability flag.
Updated discrete search space constructors and tests to use the new incremental filtering path (and added parity tests vs the naive approach).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`baybe/searchspace/utils.py`	New utilities: parameter ordering, pandas/polars cartesian product, and incremental constrained cartesian product builder.
`baybe/searchspace/discrete.py`	Switches discrete space construction to incremental filtering; Polars path builds partial product and merges remainder via pandas. Adds new `from_simplex` validation.
`baybe/constraints/base.py`	Adds `_required_filtering_parameters` and `has_polars_implementation`; updates docs for partial-dataframe filtering semantics.
`baybe/constraints/discrete.py`	Updates discrete constraints to support early/partial filtering and to raise `UnsupportedEarlyFilteringError` when unsupported.
`baybe/exceptions.py`	Adds `UnsupportedEarlyFilteringError`.
`tests/constraints/test_constrained_cartesian_product.py`	New test ensuring naive vs incremental constrained product results match across several scenarios.
`tests/constraints/test_constraints_polars.py`	Updates imports for moved cartesian product helpers.
`tests/test_searchspace.py`	Updates imports for moved cartesian product helpers.
`tests/hypothesis_strategies/alternative_creation/test_searchspace.py`	Adjusts simplex-related tests to reflect new `from_simplex` constraints.
`CHANGELOG.md`	Documents incremental filtering and new constraint capability/exception additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T00:33:48Z

baybe/searchspace/utils.py

+    for param in ordered_params:
+        param_df = pd.DataFrame({param.name: param.active_values})
+        if df.empty:
+            df = param_df
+        else:
+            df = pd.merge(df, param_df, how="cross")
+


parameter_cartesian_prod_pandas_constrained uses if df.empty: df = param_df to decide whether to start or to cross-merge. If initial_df is provided but has 0 rows (or 0 columns), df.empty is True and this path incorrectly replaces the initial dataframe instead of performing a cross join (which should stay empty). This can resurrect filtered-out rows or raise a KeyError later when reindexing to original_columns. Consider tracking initialization separately (e.g., start with df=None and check df is None, or check df.shape[1] == 0 and initial_df is None) and always pd.merge(..., how='cross') when initial_df is provided (even if empty).

Copilot · 2026-04-02T00:33:48Z

baybe/searchspace/utils.py

+
+    # Initialize the dataframe
+    if initial_df is not None:
+        df = initial_df


This function assigns df = initial_df and then mutates df in-place via drop(..., inplace=True), which will also mutate the caller-provided initial_df. If callers reuse that dataframe (or if this helper becomes public), this can cause unexpected side effects. Safer option: copy initial_df at the start (df = initial_df.copy(deep=False) or deep copy if needed).

Suggested change

df = initial_df

# Work on a copy to avoid mutating the caller-provided initial_df

df = initial_df.copy(deep=False)

Copilot · 2026-04-02T00:33:49Z

baybe/searchspace/discrete.py

+        # Validate minimum number of simplex parameters
+        if len(simplex_parameters) < 2:
+            raise ValueError(
+                f"'{cls.from_simplex.__name__}' requires at least 2 simplex "
+                f"parameters but got {len(simplex_parameters)}."
+            )


from_simplex now raises when len(simplex_parameters) < 2. This is a behavioral change versus allowing simplex_parameters=[] (product-only) or a single simplex parameter. If this is intentional, please document it as a breaking change (and consider updating validate_simplex_subspace_from_config to mirror this validation so config validation and construction behave consistently). If it’s not intentional, consider restoring support for the degenerate cases (e.g., delegating to from_product when simplex_parameters is empty).

Suggested change

# Validate minimum number of simplex parameters

if len(simplex_parameters) < 2:

raise ValueError(

f"'{cls.from_simplex.__name__}' requires at least 2 simplex "

f"parameters but got {len(simplex_parameters)}."

)

# Handle degenerate cases for backwards compatibility

# - No simplex parameters: fall back to a pure product space.

# - Single simplex parameter: treat it as a regular product parameter.

if len(simplex_parameters) == 0:

return cls.from_product(

product_parameters=product_parameters,

constraints=constraints,

)

if len(simplex_parameters) == 1:

merged_product_parameters = [*simplex_parameters, *product_parameters]

return cls.from_product(

product_parameters=merged_product_parameters,

constraints=constraints,

)

Copilot · 2026-04-02T00:33:49Z

baybe/constraints/discrete.py

+        if len(params) < 2:
+            raise UnsupportedEarlyFilteringError(
+                f"'{self.__class__.__name__}' requires at least 2 available "
+                f"parameters for filtering but got {len(params)}."
+            )


DiscreteNoLabelDuplicatesConstraint.get_invalid raises UnsupportedEarlyFilteringError when fewer than 2 parameters are available. However, if this constraint is configured with only a single parameter, this will raise even during full evaluation (breaking the previous behavior where it was trivially always-valid). Consider either validating at construction time that at least 2 parameters are required (raising ValueError), or treating the 1-parameter case as a no-op (return empty index) so full filtering paths don’t error.

Suggested change

if len(params) < 2:

raise UnsupportedEarlyFilteringError(

f"'{self.__class__.__name__}' requires at least 2 available "

f"parameters for filtering but got {len(params)}."

)

if not params:

raise UnsupportedEarlyFilteringError(

f"'{self.__class__.__name__}' has no available parameters "

f"for filtering."

)

if len(params) == 1:

# With only a single available parameter, there can be no duplicates

# across parameters, so this constraint is trivially satisfied.

return data.index[:0]

Copilot · 2026-04-02T00:33:49Z

baybe/constraints/discrete.py

    @override
    def get_invalid(self, data: pd.DataFrame) -> pd.Index:
-        mask_bad = data[self.parameters].nunique(axis=1) != 1
+        params = [p for p in self.parameters if p in set(data.columns)]
+        if len(params) < 2:
+            raise UnsupportedEarlyFilteringError(
+                f"'{self.__class__.__name__}' requires at least 2 available "
+                f"parameters for filtering but got {len(params)}."
+            )
+        mask_bad = data[params].nunique(axis=1) != 1


Same issue as the no-label-duplicates constraint: if DiscreteLinkedParametersConstraint is configured with a single parameter, the constraint is trivially satisfied but this implementation raises UnsupportedEarlyFilteringError even when all parameters are present. Consider validating min_len(2) for parameters or returning an empty index for the single-parameter case.

Copilot · 2026-04-02T00:33:49Z

CHANGELOG.md

+### Changed
+- Discrete search space construction now applies constraints incrementally during
+  Cartesian product building, significantly reducing memory usage and construction
+  time for constrained spaces
+- Polars path in discrete search space construction now builds the Cartesian product
+  only for parameters involved in Polars-capable constraints, merging the rest
+  incrementally via pandas


The changelog mentions incremental constraint application and the new UnsupportedEarlyFilteringError, but it doesn’t mention the new from_simplex restriction requiring at least 2 simplex parameters. If this restriction is intended/user-facing, it should be noted under “Changed” or “Breaking Changes” to avoid surprising downstream users.

Copilot · 2026-04-02T00:33:50Z

tests/constraints/test_constrained_cartesian_product.py

+def test_constrained_cartesian_product(scenario):
+    """Verify incremental and naive product construction produce identical results."""
+    parameters, constraints = scenario()
+
+    # Naive approach: full product then filter
+    df_naive = parameter_cartesian_prod_pandas(parameters)
+    _apply_constraint_filter_pandas(df_naive, constraints)
+
+    # Incremental approach
+    df_incremental = parameter_cartesian_prod_pandas_constrained(
+        parameters, constraints
+    )
+


Consider adding a regression test for the initial_df-empty case in parameter_cartesian_prod_pandas_constrained (e.g., initial_df with correct columns but 0 rows, then merging additional parameters should keep 0 rows). This would have caught the current df.empty initialization bug and protects future refactors.

Scienfitz self-assigned this Mar 31, 2026

Scienfitz added the enhancement Expand / change existing functionality label Mar 31, 2026

Scienfitz added this to the 0.15.0 milestone Mar 31, 2026

Scienfitz added 2 commits April 2, 2026 00:58

Add UnsupportedEarlyFilteringError exception

8193e48

Add incremental filtering interface to DiscreteConstraint

ec36345

Scienfitz force-pushed the feature/smart_cartesian_product_constraints branch from 46e3acc to 6a33c52 Compare April 1, 2026 23:11

Scienfitz added 6 commits April 2, 2026 01:44

Add incremental filtering to discrete constraint implementations

5f57d36

Add searchspace utility functions

57fd580

Rework from_product and from_simplex to use constrained Cartesian pro…

a8f627c

…duct

Update test imports for moved utility functions

bd64014

Add tests for constrained Cartesian product

70713c5

Update CHANGELOG

503fef9

Scienfitz force-pushed the feature/smart_cartesian_product_constraints branch from 6a33c52 to 503fef9 Compare April 1, 2026 23:48

Scienfitz commented Apr 2, 2026

View reviewed changes

Scienfitz marked this pull request as ready for review April 2, 2026 00:27

Scienfitz requested review from AVHopp and AdrianSosic as code owners April 2, 2026 00:27

Copilot AI review requested due to automatic review settings April 2, 2026 00:27

Copilot started reviewing on behalf of Scienfitz April 2, 2026 00:28 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smartly Apply Constraints During Cartesian Product#773

Smartly Apply Constraints During Cartesian Product#773
Scienfitz wants to merge 8 commits intomainfrom
feature/smart_cartesian_product_constraints

Scienfitz commented Mar 31, 2026 •

edited

Loading

Uh oh!

Scienfitz Apr 2, 2026

Uh oh!

Scienfitz Apr 2, 2026

Uh oh!

Scienfitz Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	df = initial_df
	# Work on a copy to avoid mutating the caller-provided initial_df
	df = initial_df.copy(deep=False)

-        # Validate minimum number of simplex parameters
-        if len(simplex_parameters) < 2:
-            raise ValueError(
-                f"'{cls.from_simplex.__name__}' requires at least 2 simplex "
-                f"parameters but got {len(simplex_parameters)}."
-            )
+        # Handle degenerate cases for backwards compatibility
+        # - No simplex parameters: fall back to a pure product space.
+        # - Single simplex parameter: treat it as a regular product parameter.
+        if len(simplex_parameters) == 0:
+            return cls.from_product(
+                product_parameters=product_parameters,
+                constraints=constraints,
+            )
+        if len(simplex_parameters) == 1:
+            merged_product_parameters = [*simplex_parameters, *product_parameters]
+            return cls.from_product(
+                product_parameters=merged_product_parameters,
+                constraints=constraints,
+            )

-        if len(params) < 2:
-            raise UnsupportedEarlyFilteringError(
-                f"'{self.__class__.__name__}' requires at least 2 available "
-                f"parameters for filtering but got {len(params)}."
-            )
+        if not params:
+            raise UnsupportedEarlyFilteringError(
+                f"'{self.__class__.__name__}' has no available parameters "
+                f"for filtering."
+            )
+        if len(params) == 1:
+            # With only a single available parameter, there can be no duplicates
+            # across parameters, so this constraint is trivially satisfied.
+            return data.index[:0]

Conversation

Scienfitz commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Good To Know

Mini Benchmark:

Uh oh!

Scienfitz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Scienfitz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Scienfitz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scienfitz commented Mar 31, 2026 •

edited

Loading