✨ Add stepwise intermediate reward for RL#526
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an approximate reward module and integrates a dual-path reward computation into the RL environment with device-average caching and reward shaping; updates PropertySet-driven postprocessing and action typings; small ML/utils and changelog edits; accompanying test updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as RL Agent
participant Env as PredictorEnv
participant Reward as calculate_reward()
participant Approx as approx_reward module
participant Cache as Device Averages Cache
participant Device as Target Device
Agent->>Env: step(action)
Env->>Env: _apply_and_update(action) -> altered_qc
Env->>Reward: calculate_reward(pre_qc, mode="auto")
Reward-->>Env: reward_pre, kind_pre
Env->>Reward: calculate_reward(altered_qc, mode="auto")
alt exact path (native & mapped)
Reward->>Device: query exact metrics / simulation
Device-->>Reward: exact evaluation
else approx path
Reward->>Cache: _ensure_device_averages_cached()
Cache->>Device: read per-gate errors & durations
Device-->>Cache: per-gate maps
Reward->>Approx: approx_estimated_success_probability(...)
Approx-->>Reward: approx reward
end
Reward-->>Env: reward_post, kind_post
Env->>Env: compute delta, apply reward_scale & no_effect_penalty
Env-->>Agent: observation, shaped_reward, done, info
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (13)
CHANGELOG.md(3 hunks)docs/setup.md(1 hunks)noxfile.py(3 hunks)pyproject.toml(1 hunks)src/mqt/predictor/ml/predictor.py(5 hunks)src/mqt/predictor/reward.py(1 hunks)src/mqt/predictor/rl/cost_model.py(1 hunks)src/mqt/predictor/rl/parsing.py(1 hunks)src/mqt/predictor/rl/predictorenv.py(10 hunks)src/mqt/predictor/utils.py(1 hunks)tests/compilation/test_predictor_rl.py(5 hunks)tests/device_selection/test_predictor_ml.py(3 hunks)tests/hellinger_distance/test_estimated_hellinger_distance.py(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-10-14T14:37:38.047Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.
Applied to files:
CHANGELOG.md
📚 Learning: 2025-11-05T09:23:46.540Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1287
File: src/qdmi/dd/Device.cpp:492-521
Timestamp: 2025-11-05T09:23:46.540Z
Learning: In the munich-quantum-toolkit/core repository, the `qasm3::Importer::imports()` function is backwards compatible with OpenQASM 2 programs. Therefore, it can be used to parse both QASM2 and QASM3 program formats without requiring separate importers for each version.
Applied to files:
src/mqt/predictor/rl/parsing.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.
Applied to files:
src/mqt/predictor/rl/parsing.pysrc/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-10-10T08:09:54.528Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: bindings/CMakeLists.txt:0-0
Timestamp: 2025-10-10T08:09:54.528Z
Learning: In the Munich Quantum Toolkit (MQT) Core project, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which automatically prefixes all CMake `DESTINATION` paths with `mqt/core/` during wheel installation. Therefore, CMake install destinations are relative to the `mqt/core` package namespace, not `site-packages`.
Applied to files:
pyproject.toml
📚 Learning: 2025-10-10T08:10:16.394Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: test/python/na/test_na_fomac.py:35-0
Timestamp: 2025-10-10T08:10:16.394Z
Learning: In the munich-quantum-toolkit/core repository, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which means CMake `install()` commands with `DESTINATION <path>` install files relative to `mqt/core/` in the wheel, making them accessible via `files("mqt.core").joinpath("<path>")`.
Applied to files:
pyproject.toml
📚 Learning: 2025-12-02T10:08:36.022Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: .github/workflows/cd.yml:40-48
Timestamp: 2025-12-02T10:08:36.022Z
Learning: In munich-quantum-toolkit workflows, the reusable-python-packaging-sdist.yml workflow uploads the sdist artifact with the name `cibw-sdist` (or `dev-cibw-sdist` for pull requests), which means it is covered by the `cibw-*` glob pattern when downloading artifacts for deployment.
Applied to files:
pyproject.toml
📚 Learning: 2025-11-27T16:58:08.564Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/ddsim PR: 720
File: noxfile.py:57-60
Timestamp: 2025-11-27T16:58:08.564Z
Learning: In Nox sessions, when using `session.install()` to install a tool and then `session.run(..., external=True)` to execute it, this pattern is valid because the session's virtual environment is activated and its bin directory is on PATH. The `external=True` flag allows calling executables on PATH, which includes the activated venv's bin directory, so it will find tools installed via `session.install()`.
Applied to files:
noxfile.py
🧬 Code graph analysis (3)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
src/mqt/predictor/ml/predictor.py (2)
compile_training_circuits(206-240)generate_training_data(242-300)
src/mqt/predictor/reward.py (2)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
device(48-50)tests/compilation/test_reward.py (1)
device(38-40)
tests/compilation/test_predictor_rl.py (2)
src/mqt/predictor/rl/cost_model.py (2)
canonical_cost(161-173)get_cost_table(140-158)src/mqt/predictor/rl/predictorenv.py (1)
calculate_reward(272-355)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py
210-210: Logging statement uses f-string
(G004)
241-241: Logging statement uses f-string
(G004)
269-269: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
666-668: try-except-continue detected, consider logging the exception
(S112)
666-666: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (19)
noxfile.py (2)
21-21: LGTM!The
pathlibimport is necessary for the filesystem operations in the new_cleanupfunction.
112-113: LGTM!The conditional cleanup call is appropriately placed and only runs in CI environments to save disk space. The pattern is consistent with other CI checks in the file.
CHANGELOG.md (1)
14-15: LGTM! Changelog entries follow the project template.The changelog entries, PR links, and contributor reference are correctly formatted and consistent with the project's changelog template. Based on learnings, the format matches the expected template.
Also applies to: 49-50, 68-68
pyproject.toml (1)
97-98: LGTM! Packaging configuration correctly added.The
check-sdistconfiguration correctly specifies that_version.pyshould only be present in source distributions (sdist), which is the expected behavior for version files generated byhatch-vcsduring the build process.src/mqt/predictor/utils.py (1)
54-54: LGTM! Appropriate logging level for platform limitation.The change from
warnings.warnwithRuntimeWarningtologger.infois appropriate. Since running without timeout on Windows is a known platform limitation rather than an error condition, an informational log message is more suitable than a warning.tests/device_selection/test_predictor_ml.py (1)
46-46: LGTM! Test benchmark level updated consistently.The change from
BenchmarkLevel.ALGtoBenchmarkLevel.INDEPis consistent with the PR objective and changelog entry about updating the test circuit level for the RL predictor. All test cases in this file have been updated consistently.Also applies to: 66-66, 96-96
src/mqt/predictor/ml/predictor.py (4)
64-64: LGTM! Platform-specific workaround for multiprocessing issue.The
NO_PARALLELconstant correctly identifies the Windows + Python 3.13+ combination where joblib's "loky" backend is broken. This is a well-documented issue with the missing_posixsubprocessmodule.
232-235: LGTM! Clear documentation and appropriate fallback.The comment clearly explains the Windows/Python 3.13 multiprocessing limitation, and the fallback to single-threaded execution (
num_jobs=1) is the appropriate workaround for this platform issue.
275-276: LGTM! Consistent application of multiprocessing workaround.The same
NO_PARALLELconditional is correctly applied to thegenerate_training_datamethod, ensuring consistent behavior across all parallel execution paths.
415-418: LGTM! Multiprocessing workaround applied to ML training.The
NO_PARALLELconditional is correctly applied to the GridSearchCV training, ensuring the RandomForest model training also respects the Windows/Python 3.13 limitation.docs/setup.md (1)
112-112: LGTM! Documentation updated to reflect new benchmark level.The documentation example has been updated from
BenchmarkLevel.ALGtoBenchmarkLevel.INDEP, which is consistent with the changes in test files and the changelog entry about updating the test circuit level for the RL predictor.tests/compilation/test_predictor_rl.py (3)
46-54: LGTM!The switch from
BenchmarkLevel.ALGtoBenchmarkLevel.INDEPaligns with the PR's focus on device-independent circuits for RL training. The test logic remains correct.
108-115: LGTM!Good addition of mapping verification. The test now validates both native gates and proper coupling map adherence, which is essential for the new dual-path reward system.
156-167: LGTM!Comprehensive test coverage for cost model fallback behavior. The identity check (
is TORINO_CANONICAL_COSTS) correctly verifies the exact table is returned rather than a copy.src/mqt/predictor/rl/cost_model.py (2)
1-35: LGTM!Well-documented module header and type definitions. The
CanonicalCostTabletype alias is appropriately defined as aMappingfor read-only access.
211-237: LGTM!The approximate fidelity calculation is correct. The defensive
max(n_1q, 0)guards and final clamping to[0, 1]ensure numerical robustness.src/mqt/predictor/rl/predictorenv.py (3)
179-188: LGTM!Good addition of instance variables for the dual-path reward system. The device averages cache (
_p1_avg,_p2_avg, etc.) and reward tracking variables (prev_reward,prev_reward_kind) are appropriately initialized.
731-740: LGTM!The method correctly checks both native gate basis and coupling map compliance. The logic is clear and matches the determine_valid_actions_for_state pattern.
458-464: Undefined methodfom_aware_compileand attributemax_iter.The code references
self.fom_aware_compileandself.max_iter, but their definitions cannot be verified without access to the codebase. If these are not defined in the class or inherited from a parent class, this will raise anAttributeErrorat runtime when a stochastic action is applied.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/mqt/predictor/rl/predictorenv.py (2)
269-269: Remove unusednoqadirective.Static analysis indicates
SLF001is not enabled, making this directive unnecessary.- self.state._layout = self.layout # noqa: SLF001 + self.state._layout = self.layout
256-264: Clarify symmetric reward scaling intent.Both positive and negative delta branches multiply by
reward_scale, whileno_effect_penaltyis only used for zero delta. The naming suggests asymmetric behavior but implementation is symmetric. Consider renaming for clarity:- self.reward_scale = 1 - self.no_effect_penalty = 0.001 + self.delta_scale = 1 + self.no_progress_penalty = 0.001And update usage accordingly. This makes the symmetric scaling intent explicit.
src/mqt/predictor/rl/cost_model.py (1)
300-301: Remove redundant initialization oft_hat.Line 300 initializes
t_hat = 0.0but it's immediately overwritten on line 301. This is dead code.- t_hat = 0.0 t_hat = (n_1q * tau1_avg + n_2q * tau2_avg) / k_eff
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
src/mqt/predictor/reward.py(1 hunks)src/mqt/predictor/rl/cost_model.py(1 hunks)src/mqt/predictor/rl/predictorenv.py(10 hunks)tests/compilation/test_predictor_rl.py(5 hunks)tests/hellinger_distance/test_estimated_hellinger_distance.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-11-01T15:57:31.153Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.
Applied to files:
src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-10-09T13:20:11.483Z
Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.
Applied to files:
src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (4)
tests/hellinger_distance/test_estimated_hellinger_distance.py (2)
src/mqt/predictor/ml/predictor.py (2)
compile_training_circuits(206-240)generate_training_data(242-300)tests/device_selection/test_predictor_ml.py (2)
path_uncompiled_circuits(27-29)path_compiled_circuits(33-35)
tests/compilation/test_predictor_rl.py (2)
src/mqt/predictor/rl/cost_model.py (2)
canonical_cost(171-183)get_cost_table(150-168)src/mqt/predictor/rl/predictorenv.py (1)
calculate_reward(272-356)
src/mqt/predictor/rl/predictorenv.py (5)
src/mqt/predictor/rl/actions.py (4)
CompilationOrigin(92-98)DeviceDependentAction(136-146)PassType(101-110)Action(114-127)src/mqt/predictor/rl/cost_model.py (2)
approx_estimated_success_probability(250-308)approx_expected_fidelity(221-247)src/mqt/predictor/rl/helper.py (1)
create_feature_dict(72-94)src/mqt/predictor/rl/parsing.py (1)
postprocess_vf2postlayout(229-250)src/mqt/predictor/utils.py (1)
calc_supermarq_features(93-145)
src/mqt/predictor/reward.py (2)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
device(48-50)tests/compilation/test_reward.py (1)
device(38-40)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py
210-210: Logging statement uses f-string
(G004)
241-241: Logging statement uses f-string
(G004)
269-269: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
🔇 Additional comments (13)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
219-226: LGTM! API calls correctly updated to use keyword arguments.The
compile_training_circuitsandgenerate_training_datacalls now properly use keyword arguments, aligning with the updated function signatures insrc/mqt/predictor/ml/predictor.py. The removal ofnum_workersis consistent with the newNO_PARALLELconstant handling parallelism internally.src/mqt/predictor/rl/cost_model.py (4)
41-147: Well-structured canonical cost tables with clear documentation.The per-device cost tables are well-organized with clear comments explaining the decomposition rationale. The registry pattern with
DEVICE_CANONICAL_COSTSprovides clean extensibility.
150-168: Safe fallback behavior with appropriate warnings.The unknown-device fallback to Torino with both a warning and log message ensures visibility while maintaining functionality. The
stacklevel=3is appropriate for the call chain.
186-218: Robust gate cost estimation with conservative fallbacks.The iteration pattern correctly uses the modern
CircuitInstructionaccess. The conservative fallback for unknown gates (3×1q or 1×2q+4×1q) is reasonable for approximation purposes.
221-247: Clean implementation of approximate fidelity and ESP calculations.Both functions properly clamp outputs to [0, 1] and handle edge cases. The ESP model's idle-time penalty with liveness modulation is a reasonable approximation approach.
Also applies to: 250-308
src/mqt/predictor/reward.py (1)
206-212: Proper handling of discrete time unit with safe fallback.The
device.dt or 1.0fallback correctly handles backends without discrete-time calibration, preventingTypeErrorwhendevice.dtisNone. The dt-scaled duration is appropriately used in the exponential decay calculation.tests/compilation/test_predictor_rl.py (3)
156-167: Good coverage for unknown-device fallback and unknown-gate defaults.The test properly verifies that unknown devices trigger a
UserWarningand fall back toTORINO_CANONICAL_COSTS, and that unknown gates return(0, 0). This aligns with the behavior defined inget_cost_tableandcanonical_cost.
170-219: Comprehensive test coverage for dual-path reward calculation.The test covers all three modes (
exact,approx,auto) for ESP and verifies thatcritical_depthalways returns exact results. The addition of the auto-exact path test (lines 197-200) addresses the previous review suggestion. The monkeypatch approach for forcing the approx path is appropriate.
108-115: Good addition of mapping validation.The
CheckMapassertion ensures the compiled circuit respects the device's coupling map, complementing the existing native-gates check.src/mqt/predictor/rl/predictorenv.py (4)
272-356: Well-designed dual-path reward calculation with proper mode handling.The
calculate_rewardmethod cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto mode correctly uses_is_native_and_mappedto determine the appropriate path. Thedevice_idparameter is now correctly passed to bothapprox_expected_fidelityandapprox_estimated_success_probability.
608-730: Thorough device calibration caching with proper error handling.The
_ensure_device_averages_cachedmethod properly aggregates 1q/2q error rates, durations, and coherence times. Key strengths:
- Handles
dtscaling for durations correctly- Tries flipped orientation for uni-directional couplings (lines 690-691)
- Uses specific exceptions
(KeyError, AttributeError)per past review feedback- Provides informative error messages when calibration data is missing
732-741: Clean helper for native/mapped circuit detection.The
_is_native_and_mappedmethod properly combines both checks usingGatesInBasisandCheckMappasses, returning a boolean suitable for auto-mode reward selection.
190-270: Solid step-wise reward shaping implementation.The refactored
step()method implements the shaped reward signal as described in the PR objectives:
- Pre-action reward evaluation establishes baseline
- Post-action delta computation drives intermediate rewards
- The
prev_kind == "approx" and new_kind == "exact"check (line 253-254) correctly handles regime transitions- Terminal actions use exact metrics for final reward
One note: The logging statements at lines 210 and 241 use f-strings (flagged by Ruff G004), but this is a stylistic preference and doesn't affect functionality.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/mqt/predictor/rl/predictorenv.py (2)
269-269: Remove unusednoqadirective.Static analysis indicates this directive is unnecessary since
SLF001is not enabled in the linter configuration.- self.state._layout = self.layout # noqa: SLF001 + self.state._layout = self.layout
458-458: Type annotation mismatch persists.The variable is annotated as
PropertySet | Nonebut initialized to{}. SincePropertySetis a Qiskit dict-like class, consider initializing toNoneand updating the downstream logic, or change the annotation todict[str, Any] | Noneto match the runtime type.- pm_property_set: PropertySet | None = {} + pm_property_set: dict[str, Any] | None = NoneThis would require updating line 477/484 to handle the
Noneinitialization:pm_property_set = dict(pm.property_set) if hasattr(pm, "property_set") else None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
src/mqt/predictor/rl/predictorenv.py(10 hunks)tests/compilation/test_predictor_rl.py(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (6)
src/mqt/predictor/reward.py (4)
crit_depth(40-43)esp_data_available(229-324)estimated_success_probability(102-226)expected_fidelity(46-74)src/mqt/predictor/rl/actions.py (5)
CompilationOrigin(92-98)DeviceDependentAction(136-146)PassType(101-110)get_actions_by_pass_type(462-467)Action(114-127)src/mqt/predictor/rl/cost_model.py (2)
approx_estimated_success_probability(250-308)approx_expected_fidelity(221-247)src/mqt/predictor/rl/helper.py (2)
create_feature_dict(72-94)get_path_training_circuits(107-109)src/mqt/predictor/rl/parsing.py (1)
postprocess_vf2postlayout(229-250)src/mqt/predictor/utils.py (1)
calc_supermarq_features(93-145)
🪛 Ruff (0.14.7)
tests/compilation/test_predictor_rl.py
227-227: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
src/mqt/predictor/rl/predictorenv.py
210-210: Logging statement uses f-string
(G004)
241-241: Logging statement uses f-string
(G004)
269-269: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
- GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
- GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
🔇 Additional comments (9)
src/mqt/predictor/rl/predictorenv.py (5)
179-188: LGTM!The renamed variables (
reward_scale,no_effect_penalty) are clearer than the previous names. The device-average caching variables are well-structured for the approximate reward path.
272-356: LGTM!The dual-path reward calculation is well-structured with clear mode selection logic. The
device_idis now correctly passed to the approximate reward functions, and the fallback behavior for unsupported reward functions is appropriately logged.
495-526: LGTM!The refactored method properly accepts the property set as a parameter, improving testability and decoupling from the PassManager. The early return for falsy
pm_property_sethandles edge cases appropriately.
732-741: LGTM!Clean helper method that encapsulates the native-and-mapped check logic. The use of Qiskit's analysis passes (
GatesInBasis,CheckMap) is appropriate.
253-254: Important regime-switch handling.Setting
delta_reward = 0.0when transitioning from approximate to exact mode prevents numerical artifacts from comparing metrics computed with different methods. This is a sound design choice.tests/compilation/test_predictor_rl.py (4)
35-44: LGTM!The new imports support testing the cost model fallback behavior and the dual-path reward computation. The
TYPE_CHECKINGpattern forMonkeyPatchis correctly used.
105-115: LGTM!Adding explicit validation via
GatesInBasisandCheckMapstrengthens the test by verifying the compiled circuit meets device constraints. The descriptive assertion messages aid debugging.
156-167: LGTM!Good coverage of cost model edge cases: unknown device fallback and unknown gate default. The test properly verifies both the warning emission and the returned values.
170-235: LGTM!Comprehensive test coverage for the dual-path reward computation:
- ESP exact/approx/auto modes
- Auto mode selecting exact for native+mapped circuits (addresses previous review feedback)
- RuntimeError on broken Target API
- Critical depth always returning exact
The monkeypatch usage for forcing the approximate path is appropriate.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
src/mqt/predictor/rl/predictorenv.py (2)
269-269: Unusednoqadirective.Static analysis confirms
SLF001is not enabled. Remove the directive.- self.state._layout = self.layout # noqa: SLF001 + self.state._layout = self.layout
458-458: Type annotation mismatch.The variable is annotated as
PropertySet | Nonebut initialized to an emptydict. Either initialize asNoneor change the annotation todict[str, Any]:- pm_property_set: PropertySet | None = {} + pm_property_set: dict[str, Any] | None = NoneThen adjust the stochastic branch return type expectation accordingly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
src/mqt/predictor/rl/predictorenv.py(10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (4)
src/mqt/predictor/reward.py (1)
crit_depth(40-43)src/mqt/predictor/rl/helper.py (1)
get_path_training_circuits(107-109)src/mqt/predictor/rl/parsing.py (2)
final_layout_bqskit_to_qiskit(186-226)postprocess_vf2postlayout(229-250)src/mqt/predictor/utils.py (1)
calc_supermarq_features(93-145)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py
210-210: Logging statement uses f-string
(G004)
241-241: Logging statement uses f-string
(G004)
269-269: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
- GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
- GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
🔇 Additional comments (3)
src/mqt/predictor/rl/predictorenv.py (3)
45-75: LGTM on import organization.The imports are well-organized, with appropriate separation of type-checking imports and runtime imports. The new imports for
approx_expected_fidelity,approx_estimated_success_probability, andcalc_supermarq_featuresalign with the dual-path reward computation feature.
272-356: Well-structured dual-path reward computation.The
calculate_rewardmethod cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto-mode logic for selecting the appropriate path based on circuit state is sound. Previous feedback about missingdevice_idhas been addressed.
495-526: LGTM on layout postprocessing refactor.The method now accepts
pm_property_setdirectly as a dict, which is cleaner than passing a PassManager. The early return on line 501-502 properly handles the None/empty case.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
src/mqt/predictor/rl/predictorenv.py (5)
17-18: Empty conditional block should be removed.This empty block with just
passserves no purpose. If it was a placeholder for future type-only imports on Python 3.11+, add a comment explaining the intent; otherwise remove the block entirely.-if sys.version_info >= (3, 11) and TYPE_CHECKING: # pragma: no cover - pass
274-274: Remove unusednoqadirective.Static analysis flags this
noqa: SLF001as unused. The directive can be removed.- self.state._layout = self.layout # noqa: SLF001 + self.state._layout = self.layout
463-463: Type annotation mismatch persists.The variable is annotated as
PropertySet | Nonebut initialized as{}(empty dict). Later it's assigneddict(pm.property_set). For type consistency, consider:- pm_property_set: PropertySet | None = {} + pm_property_set: dict[str, Any] | None = NoneThen ensure the downstream code handles
Noneappropriately (which it already does at line 506).
711-713: Consider logging a warning for partial calibration data.The check raises
RuntimeErrorwhen bothp1andp2are empty, which is correct. However, if only one list is populated (e.g., a device with only single-qubit gates), a warning would be helpful for debugging.if not p1 and not p2: msg = "No valid 1q/2q calibration data found in Target, cannot compute approximate reward." raise RuntimeError(msg) + if not p1: + logger.warning("No single-qubit calibration data found; _p1_avg will be 0.0") + elif not p2: + logger.warning("No two-qubit calibration data found; _p2_avg will be 0.0")
737-746: Consider caching analysis pass instances for performance.
_is_native_and_mappedcreates newGatesInBasisandCheckMapinstances on every call. During RL training, this method is invoked frequently viacalculate_reward. Caching these passes would improve performance.Initialize cached passes in
__init__:# In __init__: self._gates_in_basis_check: GatesInBasis | None = None self._check_map: CheckMap | None = NoneThen lazily initialize in
_is_native_and_mapped:def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool: - check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names) + if self._gates_in_basis_check is None: + self._gates_in_basis_check = GatesInBasis(basis_gates=self.device.operation_names) + check_nat_gates = self._gates_in_basis_check check_nat_gates(qc) only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"] - check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map())) + if self._check_map is None: + self._check_map = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map())) + check_mapping = self._check_map check_mapping(qc) mapped = check_mapping.property_set["is_swap_mapped"] return bool(only_nat_gates and mapped)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
src/mqt/predictor/rl/predictorenv.py(11 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (5)
src/mqt/predictor/reward.py (3)
crit_depth(40-43)estimated_success_probability(102-226)expected_fidelity(46-74)src/mqt/predictor/rl/cost_model.py (2)
approx_estimated_success_probability(250-308)approx_expected_fidelity(221-247)src/mqt/predictor/rl/helper.py (2)
create_feature_dict(72-94)get_path_training_circuits(107-109)src/mqt/predictor/rl/parsing.py (3)
final_layout_bqskit_to_qiskit(186-226)final_layout_pytket_to_qiskit(167-183)postprocess_vf2postlayout(229-250)src/mqt/predictor/utils.py (1)
calc_supermarq_features(93-145)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py
214-214: Logging statement uses f-string
(G004)
245-245: Logging statement uses f-string
(G004)
274-274: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
- GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
- GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
🔇 Additional comments (6)
src/mqt/predictor/rl/predictorenv.py (6)
88-89: LGTM: Penalty sign corrected and parameters made configurable.The
no_effect_penaltyis now correctly initialized as a negative value (-0.001), and bothreward_scaleandno_effect_penaltyare exposed as constructor parameters for tuning during RL experimentation. This addresses previous review feedback.
257-259: Good handling of regime transition.The comment clearly explains why
delta_reward = 0.0when transitioning from approximate to exact metrics—the two aren't directly comparable. This is the correct approach to avoid misleading reward signals.
277-319: Well-structured dual-path reward computation.The
calculate_rewardmethod cleanly implements the exact vs. approximate reward paths with a clear mode selection mechanism. The fallback for unsupported reward functions with a warning is good defensive coding.
500-531: LGTM: Layout postprocessing signature updated correctly.The method signature now accepts
dict[str, Any] | Nonewhich aligns with how it's called (passingdict(pm.property_set)). The early return for falsypm_property_setis a good guard.
229-231: Good defensive normalization for high-level gates.Decomposing
unitaryandcliffordgates before reward computation ensures the reward functions can properly analyze the circuit. This prevents errors from gates that don't have calibration data.
130-133: Verify whether this action registration order change breaks pre-trained model compatibility.Registering OPT actions before LAYOUT actions changes the indices in
action_set. While web search suggests PassType operates as a flexible tagging system (with the RL agent dynamically determining execution order), the action indices themselves may still be significant if:
- Pre-trained models encode action indices as state features
- Saved experience or replay buffers reference specific action indices
- Model weights learned associations with particular action index ranges
Confirm whether existing trained agents or datasets depend on action registration indices, or if this reordering is safe.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
268-269: Remove duplicate[tool.check-sdist]configuration.This section duplicates lines 97-98. TOML does not allow duplicate table headers, and parsers will reject this file, causing build failures.
Apply this diff to remove the duplicate:
-[tool.check-sdist] -sdist-only = ["src/mqt/predictor/_version.py"]
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
CHANGELOG.md(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-10-10T08:09:54.528Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: bindings/CMakeLists.txt:0-0
Timestamp: 2025-10-10T08:09:54.528Z
Learning: In the Munich Quantum Toolkit (MQT) Core project, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which automatically prefixes all CMake `DESTINATION` paths with `mqt/core/` during wheel installation. Therefore, CMake install destinations are relative to the `mqt/core` package namespace, not `site-packages`.
Applied to files:
pyproject.toml
📚 Learning: 2025-10-10T08:10:16.394Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: test/python/na/test_na_fomac.py:35-0
Timestamp: 2025-10-10T08:10:16.394Z
Learning: In the munich-quantum-toolkit/core repository, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which means CMake `install()` commands with `DESTINATION <path>` install files relative to `mqt/core/` in the wheel, making them accessible via `files("mqt.core").joinpath("<path>")`.
Applied to files:
pyproject.toml
📚 Learning: 2025-12-02T10:08:36.022Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: .github/workflows/cd.yml:40-48
Timestamp: 2025-12-02T10:08:36.022Z
Learning: In munich-quantum-toolkit workflows, the reusable-python-packaging-sdist.yml workflow uploads the sdist artifact with the name `cibw-sdist` (or `dev-cibw-sdist` for pull requests), which means it is covered by the `cibw-*` glob pattern when downloading artifacts for deployment.
Applied to files:
pyproject.toml
📚 Learning: 2025-10-14T14:37:38.047Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.
Applied to files:
CHANGELOG.md
📚 Learning: 2025-12-05T17:45:37.602Z
Learnt from: denialhaag
Repo: munich-quantum-toolkit/core PR: 1360
File: .github/workflows/reusable-mlir-tests.yml:40-43
Timestamp: 2025-12-05T17:45:37.602Z
Learning: In the munich-quantum-toolkit/core repository, patch releases of LLVM dependencies don't require documentation updates, changelog entries, or additional tests beyond what's validated by passing CI checks.
Applied to files:
CHANGELOG.md
📚 Learning: 2025-12-01T11:00:40.342Z
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: CHANGELOG.md:18-18
Timestamp: 2025-12-01T11:00:40.342Z
Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the CHANGELOG.md intentionally references the parent MQT Core repository's release notes (https://github.com/munich-quantum-toolkit/core/releases) because the plugin repository is based on work performed in the parent repository.
Applied to files:
CHANGELOG.md
🔇 Additional comments (3)
pyproject.toml (2)
97-98: LGTM!The
check-sdistconfiguration correctly ensures the hatch-vcs generated version file is included in source distributions.
130-130: LGTM!The warning filter appropriately suppresses expected warnings when tests exercise devices without defined canonical cost tables, consistent with the fallback behavior introduced in this PR.
CHANGELOG.md (1)
50-51: Changelog entries and PR links look good.Both new changelog entries are properly added to the Unreleased section, and the corresponding PR link references are correctly defined in the PR links section. The entries follow the existing format and conventions used elsewhere in the file. Once the formatting issue on line 15 is fixed, this will be ready.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
462-498: Consider clarifying the type annotation forpm_property_set.Line 463 declares
pm_property_set: PropertySet | Nonebut initializes it to{}(an empty dict). The downstream code (lines 482, 489) convertspm.property_settodict[str, Any], and line 503 checks for truthiness rather than None. For consistency:- pm_property_set: PropertySet | None = {} + pm_property_set: dict[str, Any] | None = NoneThen ensure downstream code handles
Noneappropriately (which it appears to do at line 506).
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
CHANGELOG.md(2 hunks)pyproject.toml(1 hunks)src/mqt/predictor/rl/predictorenv.py(11 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-10-14T14:37:38.047Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.
Applied to files:
CHANGELOG.md
📚 Learning: 2025-12-05T17:45:37.602Z
Learnt from: denialhaag
Repo: munich-quantum-toolkit/core PR: 1360
File: .github/workflows/reusable-mlir-tests.yml:40-43
Timestamp: 2025-12-05T17:45:37.602Z
Learning: In the munich-quantum-toolkit/core repository, patch releases of LLVM dependencies don't require documentation updates, changelog entries, or additional tests beyond what's validated by passing CI checks.
Applied to files:
CHANGELOG.md
📚 Learning: 2025-12-01T11:00:40.342Z
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: CHANGELOG.md:18-18
Timestamp: 2025-12-01T11:00:40.342Z
Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the CHANGELOG.md intentionally references the parent MQT Core repository's release notes (https://github.com/munich-quantum-toolkit/core/releases) because the plugin repository is based on work performed in the parent repository.
Applied to files:
CHANGELOG.md
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.
Applied to files:
src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (6)
src/mqt/predictor/reward.py (3)
crit_depth(40-43)estimated_success_probability(102-226)expected_fidelity(46-74)src/mqt/predictor/rl/actions.py (3)
CompilationOrigin(92-98)PassType(101-110)Action(114-127)src/mqt/predictor/rl/cost_model.py (2)
approx_estimated_success_probability(250-308)approx_expected_fidelity(221-247)src/mqt/predictor/rl/helper.py (1)
create_feature_dict(72-94)src/mqt/predictor/rl/parsing.py (1)
final_layout_bqskit_to_qiskit(186-226)src/mqt/predictor/utils.py (1)
calc_supermarq_features(93-145)
🪛 Ruff (0.14.8)
src/mqt/predictor/rl/predictorenv.py
214-214: Logging statement uses f-string
(G004)
245-245: Logging statement uses f-string
(G004)
274-274: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
- GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
- GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
🔇 Additional comments (8)
pyproject.toml (1)
127-128: LGTM! Test warning filter appropriately scoped.The new warning filter suppresses expected UserWarning messages from the canonical cost model when testing with devices that lack predefined cost tables. This is a reasonable approach to reduce test noise while still catching unexpected warnings.
CHANGELOG.md (2)
14-15: LGTM! Changelog entries follow project conventions.Both entries are properly formatted with the correct template structure. The previous formatting issue has been resolved.
50-51: LGTM! PR link references are correct.The PR links are properly formatted and correspond to the new changelog entries.
src/mqt/predictor/rl/predictorenv.py (5)
25-26: LGTM! Import additions support new dual-path reward functionality.The new imports (PropertySet, CheckMap, GatesInBasis, cost_model functions, calc_supermarq_features) are necessary for the dual-path reward computation and device-averages caching introduced in this PR.
Also applies to: 45-48, 58-69, 75-75
88-89: LGTM! Reward shaping parameters are now configurable.The new
reward_scaleandno_effect_penaltyparameters allow tuning reward behavior during RL training. The negative default forno_effect_penaltycorrectly penalizes actions that produce no improvement. The instance variables properly support the dual-path reward computation and device-averages caching.Also applies to: 97-98, 183-192
194-276: LGTM! Dual-path reward computation is well-structured.The rewritten
step()method correctly implements shaped intermediate rewards with proper handling of:
- Exact vs. approximate reward paths
- Regime transitions (zeroing delta when switching from approx to exact)
- Symmetric scaling for positive/negative deltas
- No-effect penalty for actions that don't change the circuit
The logic is clear and the approach aligns with the PR objectives.
Minor: Remove unused
noqadirective and use lazy logging.Static analysis flags three minor issues:
- Line 274: The
# noqa: SLF001directive is unused and should be removed.- Lines 214, 245: f-string logging should use lazy
%formatting for performance (e.g.,logger.info("Applying %s", self.action_set[action].name)).Apply this diff:
- logger.info(f"Applying {self.action_set[action].name}") + logger.info("Applying %s", self.action_set[action].name) # ... (code omitted) - logger.info(f"Final reward ({final_kind}): {final_val}") + logger.info("Final reward (%s): %s", final_kind, final_val) # ... (code omitted) - self.state._layout = self.layout # noqa: SLF001 + self.state._layout = self.layout
277-361: LGTM! Dual-path reward calculation is well-implemented.The
calculate_reward()method correctly implements the dual-path (exact vs. approximate) reward computation:
- The
modeparameter allows flexible control over which path to use- The "auto" mode intelligently selects based on whether the circuit is native and mapped
- Device-specific cost tables are correctly selected via
device_idparameter- Unsupported reward functions gracefully fall back to exact computation with a warning
The implementation aligns well with the PR objectives for shaped intermediate rewards.
500-531: LGTM! Layout postprocessing refactor aligns with property set handling.The signature change to accept
dict[str, Any] | Noneinstead of a PassManager is a cleaner design that decouples layout extraction from the PassManager object. The logic correctly handles missing property sets and properly constructs the TranspileLayout.
There was a problem hiding this comment.
Hey @Shaobo-Zhou once again thank you very much for your continued efforts - LGTM! There is really only a small nitpick from my end left.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
src/mqt/predictor/rl/cost_model.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-09T13:20:11.483Z
Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.
Applied to files:
src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-11-01T15:57:31.153Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.
Applied to files:
src/mqt/predictor/rl/cost_model.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
- GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
- GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
🔇 Additional comments (5)
src/mqt/predictor/rl/cost_model.py (5)
1-36: LGTM: Clean module structure and imports.The module-level organization is solid: clear documentation, appropriate use of TYPE_CHECKING for forward references, and a logger for diagnostics. The
CanonicalCostTabletype alias usingMappingcorrectly conveys immutability.
42-148: LGTM: Device-specific cost tables are well-documented.The canonical cost tables correctly reflect different native gate sets across devices (CZ for Torino, iSWAP for Ankaa3, CZ for Emerald). The decomposition comments (e.g., CX → H•CZ•H ⇒ 6×1q + 1×2q) provide clear rationale. Gate coverage differences are intentional and align with each device's architecture.
151-169: LGTM: Safe fallback with appropriate warnings.The unknown-device fallback to Torino is well-communicated via both
warnings.warnand logger. The dual notification ensures visibility in different deployment contexts.
172-185: LGTM: Simple and correct delegation.The function correctly wraps
get_cost_tableand provides the documented (0, 0) fallback for unknown gates.
223-249: LGTM: Correct fidelity approximation with defensive bounds.The multiplicative fidelity model F = (1−p₁)^n₁ · (1−p₂)^n₂ is standard for independent gate errors. Defensive clamping and max() calls ensure numerical robustness.
There was a problem hiding this comment.
@Shaobo-Zhou just did a very quick review here in order to not hold this up for too long.
I think this will probably need one more round of iteration until it is ready to be merged. At the moment it feels like it's cutting a few corners too many.
Edit: There were also quite some CodeRabbit comments left, which should ideally be resolved.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@CHANGELOG.md`:
- Around line 55-58: The CHANGELOG.md contains a link definition for [`#449`] that
is not referenced; either remove the unused link definition "[`#449`]:
https://github.com/munich-quantum-toolkit/predictor/pull/449" or add a
corresponding entry in the Unreleased changelog that references [`#449`] so
markdownlint no longer flags it as unused—update the Unreleased section to
include a line like "PR [`#449`] - <short description>" or delete the [`#449`] link
definition entirely.
In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 171-172: The current code uses QuantumCircuit.size() and writes
cost_table[...] = (qc_trans.size(), 0), which incorrectly treats all transpiled
basis gates as 1-qubit; instead iterate qc_trans.data (or
qc_trans.count_ops()/instructions) to compute separate 1-qubit and 2-qubit
counts and store them as the tuple (n_1q, n_2q) in cost_table. Locate the
transpile call and the assignment to cost_table (qc_trans and cost_table
entries) and replace the single size() value with explicit counting by
instruction arity (e.g., count len(instr.qargs) == 1 vs == 2) so downstream
fidelity/ESP uses both 1q and 2q contributions.
- Around line 150-167: The add_gate_to_cost_table function is double-prefixing
gates like "crx"/"cu3"/"ccx" causing invalid method names; fix by detecting when
the supplied gate string already encodes control prefixes and avoid prepending
"c"*controls in that case (e.g., if gate.startswith("c") then use gate as-is or
strip/interpret its inherent control count rather than adding extra "c"s), and
for the multi-control group (entries like "ccx","c3x","c4x") either replace them
with their base gate names plus the correct controls (e.g., "x" with
controls=2/3/4) or list each gate with the accurate controls count so
add_gate_to_cost_table uses the correct method name (refer to
add_gate_to_cost_table, the gate list entries and the logic building gate_name).
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 312-321: calculate_reward currently falls through to the fallback
for the "estimated_hellinger_distance" mode because there is no explicit branch;
add a branch inside the same block (where crit_depth(qc) is handled) that checks
if self.reward_function == "estimated_hellinger_distance" and returns the
computed Hellinger reward and an appropriate tag (e.g.,
(estimated_hellinger_distance(qc), "exact") or call the existing helper used
elsewhere if named differently), so the estimator is used instead of returning
0.0 and logging a warning.
- Around line 484-489: The routing branch currently assumes pm_property_set
contains "final_layout" and directly assigns self.layout.final_layout =
pm_property_set["final_layout"], which can raise KeyError because
pm_property_set may be an empty dict; update the guard to verify the key exists
(e.g., use pm_property_set.get("final_layout") or an explicit 'if "final_layout"
in pm_property_set') before assigning, and only set self.layout.final_layout
when a non-None value is returned; target the branch that checks action_index in
self.actions_routing_indices and uses self.layout and pm_property_set.
- Around line 222-230: The current termination check "if not altered_qc" treats
empty QuantumCircuit objects as falsy and prematurely ends episodes; change the
check to explicitly test for None (i.e., "if altered_qc is None") so only
intentional termination from apply_action triggers episode end, leaving empty
but valid circuits to continue; update the branch that returns
create_feature_dict(self.state), 0.0, True, False, {} to run only when
altered_qc is None and keep using apply_action and create_feature_dict as
before.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/reward.py (1)
207-212:⚠️ Potential issue | 🟠 MajorAddress incorrect exponential decay calculation when
device.dtis unavailable.The fallback
device.dt or 1.0is semantically incorrect. In Qiskit,instruction.durationis measured in time steps (units ofdt), anddevice.dtconverts those steps to seconds. Using1.0assumes each time step equals 1 second, but typicaldevice.dtvalues are ~2.22e-10 seconds (nanoseconds). This causes the decay calculation to be off by ~10 orders of magnitude whendtis unavailable.Qiskit's
Target.dtis legitimatelyNonefor some backends (not a rare edge case). Either:
- Skip the exponential decay term for backends without calibrated
dt- Log a warning and use a more conservative estimate
- Propagate
Noneand handle it at a higher level
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/cost_model.py`:
- Line 119: The logging call in logger.warning currently uses an f-string which
forces formatting even when the warning level may be disabled; update the call
in the cost generation code (reference: the logger.warning line in
src/mqt/predictor/rl/cost_model.py that mentions "No basis for device") to use
lazy %-style or parameterized logging (e.g., logger.warning("No basis for device
'%s', using minimal universal basis for cost generation.", device_id)) so the
string is only formatted when the message will actually be emitted.
- Around line 167-168: Replace the bare "except Exception:" in
src/mqt/predictor/rl/cost_model.py with a specific exception catch for missing
gate methods (e.g., "except AttributeError:"), or a small explicit tuple like
"except (AttributeError, NotImplementedError):" so you only silence expected
missing-method errors while allowing other bugs (TypeError, ValueError, etc.) to
surface; update the except block that currently returns on error to use the
narrowed exception types.
- Around line 105-107: The if-check that tests "if transpile is None or
QuantumCircuit is None" is unreachable because those symbols are imported at
module level; remove that conditional and the accompanying ImportError so the
function no longer checks for None, or if you intended lazy importing, instead
move the Qiskit imports (transpile, QuantumCircuit) into the function and wrap
them in a try/except ImportError that raises the same message—update the code
around the function that references transpile/QuantumCircuit accordingly
(referenced symbols: transpile, QuantumCircuit in cost_model.py).
- Around line 170-171: The cost_table currently stores a flat total with a zero
second element; update generate_cost_table to compute and store the tuple (n_1q,
n_2q) by using the transpiled circuit qc_trans and the existing
estimate_basis_gate_counts utility and TWO_Q_GATES set: call
estimate_basis_gate_counts(qc_trans, basis_gates) (or the appropriate signature)
to get per-basis counts, sum counts for gates in TWO_Q_GATES to produce n_2q and
sum the remaining counts for n_1q, then assign cost_table[gate if controls == 0
else ("c" * controls) + gate] = (n_1q, n_2q) instead of (qc_trans.size(), 0).
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 621-622: The getattr call checking "_dev_avgs_cached" is redundant
because __init__ already sets self._dev_avgs_cached = False; replace
getattr(self, "_dev_avgs_cached", False) with a direct attribute access
(self._dev_avgs_cached) in the code path shown, so the condition becomes if
self._dev_avgs_cached: return; keep the existing __init__ initialization as the
source of truth and remove the defensive default.
- Line 250: Replace the eager f-string logging call logger.info(f"Final reward
({self.prev_reward_kind}): {self.prev_reward}") with lazy logging formatting to
avoid unnecessary string interpolation; use logger.info("Final reward (%s): %s",
self.prev_reward_kind, self.prev_reward) so that self.prev_reward_kind and
self.prev_reward are passed as parameters and only formatted if the log level is
enabled.
- Line 221: Replace the f-string logging call with lazy logging formatting to
avoid eager interpolation: in the code that currently calls
logger.info(f"Applying {self.action_set[action].name}") (within PredictorEnv /
predictorenv.py, using the logger, self.action_set and action variables), call
logger.info with a format string and pass self.action_set[action].name as an
argument (e.g., "Applying %s", self.action_set[action].name) so interpolation is
deferred until the log is emitted.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/utils.py (1)
49-55:⚠️ Potential issue | 🟡 MinorUpdate docstring: RuntimeWarning is no longer raised.
The Windows branch now logs and proceeds without raising; the Raises section is out of sync and should be corrected.
✏️ Suggested docstring update
- Raises: - RuntimeWarning: If the timeout is not supported on the current platform (e.g., Windows). - TimeoutExceptionError: If the function call exceeds the timeout limit. + Raises: + TimeoutExceptionError: If the function call exceeds the timeout limit (non-Windows).
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 33-58: Define the missing TORINO_CANONICAL_COSTS constant and
implement a canonical_cost function and unknown-device fallback used by
get_cost_table: add a TORINO_CANONICAL_COSTS mapping containing the canonical
cost table for the Torino device, implement canonical_cost(device_name: str) ->
CanonicalCostTable to return a known mapping for recognized names and otherwise
log a warning and return TORINO_CANONICAL_COSTS as the default; modify
get_cost_table to call canonical_cost when the device is not in
DEVICE_COST_CACHE/KNOWN_DEVICE_BASES instead of generating a new random table,
and ensure TORINO_CANONICAL_COSTS, canonical_cost, and get_cost_table are
exported/defined in the module so tests can import them.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/approx_reward.py`:
- Around line 62-63: The code repeatedly calls estimate_basis_gate_counts (which
internally transpiles) for the same circuit state; modify calculate_reward
(and/or predictorenv.py's step) to memoize the basis-gate counts keyed by a
canonical representation of the quantum circuit (e.g., serialized QASM or a hash
of qc.data + device target), reuse the cached result for subsequent calls, and
fall back to calling get_basis_gates_from_target and estimate_basis_gate_counts
only on cache misses; ensure the cache key and lookup are used wherever basis =
get_basis_gates_from_target(device) and counts = estimate_basis_gate_counts(qc,
basis_gates=basis) are invoked so duplicate transpilation is avoided.
- Around line 30-40: The loop in estimate_basis_gate_counts uses deprecated
tuple unpacking of qc_t.data; replace it by iterating over CircuitInstruction
objects and access attributes directly (operation/qubits/clbits). Concretely, in
estimate_basis_gate_counts change the loop to iterate like "for circ_instr in
qc_t.data:", extract the gate with "instr = circ_instr.operation" (and use
circ_instr.qubits / circ_instr.clbits if needed), then use instr.name and the
existing BLACKLIST/counts logic.
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 259-261: The code only suppresses delta_reward when transitioning
from "approx"→"exact"; change the logic in the reward computation (referencing
prev_reward_kind, new_kind, and delta_reward in predictorenv.py) to suppress the
delta for any regime change (i.e., if prev_reward_kind != new_kind then set
delta_reward = 0.0) and ensure prev_reward_kind is still updated after this
check so later steps see the new regime; adjust the conditional that currently
checks (self.prev_reward_kind == "approx" and new_kind == "exact") to a generic
regime-change check and keep existing comments about metrics comparability.
- Around line 623-624: Replace the local duplicate gate_blacklist in
_ensure_device_averages_cached with the shared BLACKLIST constant from
approx_reward to avoid divergence; import BLACKLIST from the approx_reward
module and use it when building basis_ops (i.e., change basis_ops = [name for
name in op_names if name not in gate_blacklist] to use BLACKLIST), removing the
local {"measure","reset","delay","barrier"} definition.
- Around line 263-268: The two branches computing reward_val in predictorenv.py
are identical; replace the if/elif block that compares delta_reward with a
single assignment using the existing symbols: set reward_val = self.reward_scale
* delta_reward (preserving delta_reward == 0.0 result) inside the same function
where delta_reward and reward_val are used so you remove the redundant
conditional logic.
- Around line 218-229: Calculate and assign prev_reward and prev_reward_kind
only when the chosen action is non-terminal: move the call to calculate_reward
from before apply_action into the branch that handles non-terminal actions
(i.e., the branch that proceeds after altered_qc is not None and not the
terminate action), so terminal/terminate actions skip the expensive pre-action
evaluation; ensure apply_action, calculate_reward, and the used_actions logic
remain unchanged. Also initialize prev_reward and prev_reward_kind in reset()
(or add a guard before use) so the first step has defined values when no prior
reward exists.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 268-352: Update the calculate_reward signature to narrow the mode
parameter from plain str to a Literal type (Literal["auto", "exact", "approx"])
to catch invalid modes at type-check time; add the appropriate import for
Literal (from typing or typing_extensions depending on project compatibility)
and update any affected type hints/exports so static checkers (mypy/pyright)
pick it up—no logic changes needed inside calculate_reward or its use of
self._is_native_and_mapped, expected_fidelity,
approx_estimated_success_probability, etc.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
104-104: 🧹 Nitpick | 🔵 TrivialUse lazy logging formatting for consistency.
String concatenation in the log call is inconsistent with the
%s-style formatting used elsewhere in this file (e.g., lines 217, 271). Ruff rule G003 discourages eager string building in logging calls.♻️ Suggested fix
- logger.info("Init env: " + reward_function) + logger.info("Init env: %s", reward_function)
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 244-245: The call to calculate_reward at the start of each step is
redundant because self.prev_reward and self.prev_reward_kind already hold the
post-action reward from the previous step; remove the redundant invocation in
the step logic (the line calling calculate_reward that sets
self.prev_reward/self.prev_reward_kind) and instead initialize/seed
self.prev_reward and self.prev_reward_kind during reset() by calling
calculate_reward() once there, ensuring the end-of-step code that updates
prev_reward/prev_reward_kind (lines around where post-action reward is set)
continues to overwrite for the next step; adjust any references in step() and
reset() accordingly to reuse the carried-over prev_reward/prev_reward_kind
without recomputing on each new step.
- Around line 726-735: _determine_valid_actions_for_state and
_is_native_and_mapped duplicate expensive GatesInBasis/CheckMap work; cache the
pass instances and share computed results to avoid repeated checks. Create
cached pass instances on self (e.g., self._gates_in_basis and self._check_map)
initialized once and reused in both determine_valid_actions_for_state and
_is_native_and_mapped, and have determine_valid_actions_for_state store the last
computed flags (e.g., self._last_only_nat_gates and self._last_mapped) after
running the checks so _is_native_and_mapped can return those cached booleans if
available instead of re-running the passes; fall back to running the cached pass
instances if no cached results exist and update the cache.
- Around line 218-242: The estimated_hellinger_distance branch duplicates the
apply→decompose→update-state→check-valid-actions logic found later; extract that
shared sequence into a helper (e.g. _apply_and_update or
_apply_action_and_update_state) which calls apply_action, performs the gate-type
decompose loop, sets self.state and self.state._layout, increments
self.num_steps, updates self.valid_actions via determine_valid_actions_for_state
and raises RuntimeError if empty; then replace the duplicated block in the
reward_function == "estimated_hellinger_distance" branch to call the helper and
only keep the branch-specific reward logic (calculate_reward(mode="exact") vs
no_effect_penalty) before returning create_feature_dict(self.state), reward,
done, False, {}.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/compilation/test_predictor_rl.py`:
- Around line 149-204: Tests test_approx_reward_ef and test_approx_reward_esp
duplicate setup and assertions; replace them with a single parametrized test
using pytest.mark.parametrize over figure_of_merit values ("expected_fidelity"
and "estimated_success_probability") that creates Predictor(figure_of_merit=...)
and monkeypatches predictor.env._is_native_and_mapped the same way, runs val,
kind = predictor.env.calculate_reward(...), asserts common conditions (kind ==
"approx", 0<=val<=1, predictor.env._dev_avgs_cached, isinstance checks and
len(predictor.env._err_by_gate)>0, len(predictor.env._dur_by_gate)>0), and then
include the ESP-specific assertions (assert len(..._dur_by_gate)>0 and assert
predictor.env._tbar is None or predictor.env._tbar > 0.0) only when the
parameter == "estimated_success_probability"; keep references to Predictor,
predictor.env.calculate_reward, _is_native_and_mapped, _err_by_gate,
_dur_by_gate, and _tbar to locate the code.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
562-576: 🧹 Nitpick | 🔵 Trivial
determine_valid_actions_for_statecreates fresh pass instances while_native_and_mappedreuses cached ones.Both methods perform the same GatesInBasis + CheckMap checks.
determine_valid_actions_for_statestill instantiates new objects each call (lines 564, 574), while_native_and_mappeduses the cached instances from__init__. Consider reusing the cached instances here too, or calling_native_and_mappedinternally.
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 705-717: The cache in _native_and_mapped uses id(qc) which can be
recycled and cause stale hits; either remove the cache entirely and always run
_gates_in_basis_check and _check_map, or replace the id-based key with a safe
version counter: add a _state_version integer on the class (initialize in
__init__), increment _state_version whenever self.state is reassigned, store and
compare _last_state_version (instead of _last_qc_id) alongside
_last_native_mapped in _native_and_mapped, and update that version when you set
self.state so the cached (only_native, mapped) is only reused for the current
state.
In `@tests/compilation/test_predictor_rl.py`:
- Around line 175-178: The condition is checking the imported type alias
figure_of_merit instead of the test parameter fom, so the ESP-specific
assertions (predictor.env._dur_by_gate and predictor.env._tbar) never run;
update the conditional to use the test parameter (if fom ==
"estimated_success_probability") so the assertions execute for the ESP reward
path and validate duration/tbar caching for predictor.env._dur_by_gate and
predictor.env._tbar.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 636-639: The suppress(KeyError, AttributeError, TypeError) around
the basis_ops loop is too broad and can hide legitimate TypeError bugs when
calling target.operation_from_name or reading op.num_qubits; change the
exception list to (KeyError, AttributeError) (i.e., remove TypeError) and
optionally add a debug/log line inside the except path that records the gate
name and the caught exception so skipped gates are visible; locate the loop
using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to
make the change.
- Around line 238-243: The code path for reward_function ==
"estimated_hellinger_distance" only gives the episode terminal step a real
reward and assigns no_effect_penalty for every non-terminal step, which leaves
the agent without shaping signal; update the branch in predictorenv.py (the
block checking self.reward_function == "estimated_hellinger_distance") to either
implement a non-terminal shaping heuristic or, if this is intentional, add a
concise explanatory comment above the block referencing that choice and the
involved symbols (self.reward_function, "estimated_hellinger_distance",
self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index)
so readers know why calculate_reward is only called at done and why non-terminal
steps receive no signal.
In `@tests/compilation/test_predictor_rl.py`:
- Around line 165-170: The monkeypatch targets a non-existent method
`_is_native_and_mapped` and has no effect; update the test to force the
approximate path by calling calculate_reward with mode="approx" instead of
relying on device mapping heuristics. Locate the call to
predictor.env.calculate_reward(qc=qc, mode="auto") in the test and change the
mode argument to "approx" (remove the monkeypatch line that sets
`_is_native_and_mapped`), ensuring the test exercises the approximate reward
branch in calculate_reward rather than depending on GatesInBasis/CheckMap
behavior.
| done = action == self.action_terminate_index | ||
|
|
||
| if self.reward_function == "estimated_hellinger_distance": | ||
| reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty | ||
| self.state._layout = self.layout # noqa: SLF001 | ||
| return create_feature_dict(self.state), reward_val, done, False, {} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hellinger path provides no shaping signal — non-terminal steps always get a flat penalty.
This is a design choice, but it means the RL agent receives zero useful gradient signal during an episode for estimated_hellinger_distance — reward information only comes at termination. If this is intentional (e.g., because there's no approximate Hellinger model), a brief comment explaining the rationale would help future readers.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 242-242: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In `@src/mqt/predictor/rl/predictorenv.py` around lines 238 - 243, The code path
for reward_function == "estimated_hellinger_distance" only gives the episode
terminal step a real reward and assigns no_effect_penalty for every non-terminal
step, which leaves the agent without shaping signal; update the branch in
predictorenv.py (the block checking self.reward_function ==
"estimated_hellinger_distance") to either implement a non-terminal shaping
heuristic or, if this is intentional, add a concise explanatory comment above
the block referencing that choice and the involved symbols
(self.reward_function, "estimated_hellinger_distance", self.no_effect_penalty,
self.calculate_reward, and self.action_terminate_index) so readers know why
calculate_reward is only called at done and why non-terminal steps receive no
signal.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
@flowerthrower You're absolutely right — this does look like a design oversight rather than an intentional choice.
The current code assigns no_effect_penalty to every non-terminal step when using estimated_hellinger_distance, regardless of whether the circuit actually improved or degraded. This is inconsistent with how the environment handles expected_fidelity and estimated_success_probability below (lines 244+), where:
- The previous reward is cached
- A new reward is computed after each action
- The step reward is the scaled delta (
reward_scale * (new_val - prev_reward)) - Only when
delta == 0.0does it fall back tono_effect_penalty
The absence of an approximate Hellinger model simply means you can't use the transpile-based proxy — but you can still compute the exact Hellinger distance at each step via self.calculate_reward(mode="exact") and shape rewards based on the delta, just like EF/ESP do.
Suggested fix:
Remove the special-case early return for Hellinger and let it flow through the same delta-based shaping logic below, but always call calculate_reward(mode="exact") (since mode="auto" would also choose "exact" anyway, as there's no approximate path). That would give the agent meaningful gradient signal during the episode.
✏️ Learnings added
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/predictor PR: 526
File: src/mqt/predictor/rl/predictorenv.py:238-243
Timestamp: 2026-02-09T13:15:50.387Z
Learning: When using reward_function == "estimated_hellinger_distance" in PredictorEnv (src/mqt/predictor/rl/predictorenv.py), the current implementation only computes the exact reward at episode termination and assigns no_effect_penalty to all non-terminal steps, which prevents intermediate shaping signal. This is likely unintentional; the Hellinger distance can be computed exactly at each step and used for delta-based reward shaping, similar to expected_fidelity and estimated_success_probability.
</add_learning>
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
flowerthrower
left a comment
There was a problem hiding this comment.
Hi there, great work so far. Only a few comments left from my end.
| done = action == self.action_terminate_index | ||
|
|
||
| if self.reward_function == "estimated_hellinger_distance": | ||
| reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty | ||
| self.state._layout = self.layout # noqa: SLF001 | ||
| return create_feature_dict(self.state), reward_val, done, False, {} |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| new_val, new_kind = self.calculate_reward(mode="auto") | ||
| delta_reward = new_val - self.prev_reward | ||
|
|
||
| if self.prev_reward_kind != new_kind: |
There was a problem hiding this comment.
Why do we apply a no-effect penalty here? Doesn't the change of the reward_kind imply that the action DID have an effect?
There was a problem hiding this comment.
This was marked unresolved but is still the case? @Shaobo-Zhou
|
@coderabbitai provide a detailed summary of the structural and methodical changes in this PR |
|
Tip For best results, initiate chat on the files or code changes.
🏗️ Structural ChangesNew Module
Core Environment Refactoring (
|
🧪 CI InsightsHere's what we observed from your CI run for 6e5e9b3. 🟢 All jobs passed!But CI Insights is watching 👀 |
flowerthrower
left a comment
There was a problem hiding this comment.
Hey Shaobo, I didn't quite finish the review yet but wanted to share some initial thoughts already. No hurry, but it would be nice to either address these issues or discuss them next week. Anyway I am happy to see this progressing already!👍
| new_val, new_kind = self.calculate_reward(mode="auto") | ||
| delta_reward = new_val - self.prev_reward | ||
|
|
||
| if self.prev_reward_kind != new_kind: |
There was a problem hiding this comment.
This was marked unresolved but is still the case? @Shaobo-Zhou
f628b73
into
munich-quantum-toolkit:qce-experiments
Description
This PR introduces a shaped, step-wise reward signal for the RL-based compiler.
For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:
Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.
Checklist: