Skip to content

✨ Add stepwise intermediate reward for RL#526

Merged
flowerthrower merged 13 commits into
munich-quantum-toolkit:qce-experimentsfrom
Shaobo-Zhou:new_RL
Mar 18, 2026
Merged

✨ Add stepwise intermediate reward for RL#526
flowerthrower merged 13 commits into
munich-quantum-toolkit:qce-experimentsfrom
Shaobo-Zhou:new_RL

Conversation

@Shaobo-Zhou
Copy link
Copy Markdown
Collaborator

@Shaobo-Zhou Shaobo-Zhou commented Nov 26, 2025

Description

This PR introduces a shaped, step-wise reward signal for the RL-based compiler.

For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:

  1. Exact regime (native + mapped circuits)
    If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.

  2. Approximate regime (non-native / unmapped circuits)
    If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.

Checklist:

  • The pull request only contains commits that are focused and relevant to this change.
  • I have added appropriate tests that cover the new/changed functionality.
  • I have updated the documentation to reflect these changes.
  • I have added entries to the changelog for any noteworthy additions, changes, fixes, or removals.
  • I have added migration instructions to the upgrade guide (if needed).
  • The changes follow the project's style guidelines and introduce no new warnings.
  • The changes are fully tested and pass the CI checks.
  • I have reviewed my own code changes.

@Shaobo-Zhou Shaobo-Zhou changed the title Add step-wise intermediate reward Add stepwise intermediate reward Nov 26, 2025
@Shaobo-Zhou Shaobo-Zhou changed the title Add stepwise intermediate reward Add stepwise intermediate reward for RL Nov 26, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 93.22917% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mqt/predictor/rl/approx_reward.py 93.7% 7 Missing ⚠️
src/mqt/predictor/rl/predictorenv.py 92.5% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Shaobo-Zhou Shaobo-Zhou marked this pull request as ready for review December 3, 2025 22:07
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 3, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an approximate reward module and integrates a dual-path reward computation into the RL environment with device-average caching and reward shaping; updates PropertySet-driven postprocessing and action typings; small ML/utils and changelog edits; accompanying test updates.

Changes

Cohort / File(s) Summary
RL environment core
src/mqt/predictor/rl/predictorenv.py
Adds ctor args reward_scale, no_effect_penalty; new methods calculate_reward, _ensure_device_averages_cached; changes action application flow, introduces _apply_and_update, shapes delta rewards, tracks prev_reward/kind, and accepts PropertySet-style postprocessing.
Approximate reward module
src/mqt/predictor/rl/approx_reward.py
New module: derive device basis gates, transpile-and-count basis gates, compute approximate expected fidelity and estimated success probability (ESP) using per-gate error rates, durations, optional tbar, and parallelism/liveliness features.
Reward math tweak
src/mqt/predictor/reward.py
Use device.dt or 1.0 fallback when computing idle-time decay for single-qubit delay in ESP calculation.
Action typing change
src/mqt/predictor/rl/actions.py
DeviceDependentAction.do_while callback type changed to accept PropertySet instead of dict[str, Circuit]; TYPE_CHECKING imports adjusted.
Postprocessing typing
src/mqt/predictor/rl/predictorenv.py (signatures)
_handle_qiskit_layout_postprocessing signature now accepts a PropertySet-style dict[str, Any] (PropertySet semantics) rather than a PassManager object; type hints updated.
ML & utils minor
src/mqt/predictor/ml/predictor.py, src/mqt/predictor/utils.py
Whitespace trimming in ML predictor; replaced Windows RuntimeWarning with logger.info for unsupported signal timeout while retaining non-Windows signal-based timeout behavior.
Tests
tests/compilation/test_predictor_rl.py, tests/device_selection/test_predictor_ml.py, tests/hellinger_distance/test_estimated_hellinger_distance.py
Added test ensuring approximate-reward path uses cached per-gate maps; switched tests from BenchmarkLevel.ALG to BenchmarkLevel.INDEP; increased compile timeout and adjusted compile_training_circuits/generate_training_data calls to keyword form; added mapping checks and minor assertion edits.
Infrastructure & changelog
.pre-commit-config.yaml, CHANGELOG.md
Changed ty-check hook language from unsupported to system; CHANGELOG updated to mention intermediate rewards, removed items for custom trained model names and x86 macOS support, and added PR link reference.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as RL Agent
    participant Env as PredictorEnv
    participant Reward as calculate_reward()
    participant Approx as approx_reward module
    participant Cache as Device Averages Cache
    participant Device as Target Device

    Agent->>Env: step(action)
    Env->>Env: _apply_and_update(action) -> altered_qc
    Env->>Reward: calculate_reward(pre_qc, mode="auto")
    Reward-->>Env: reward_pre, kind_pre
    Env->>Reward: calculate_reward(altered_qc, mode="auto")
    alt exact path (native & mapped)
        Reward->>Device: query exact metrics / simulation
        Device-->>Reward: exact evaluation
    else approx path
        Reward->>Cache: _ensure_device_averages_cached()
        Cache->>Device: read per-gate errors & durations
        Device-->>Cache: per-gate maps
        Reward->>Approx: approx_estimated_success_probability(...)
        Approx-->>Reward: approx reward
    end
    Reward-->>Env: reward_post, kind_post
    Env->>Env: compute delta, apply reward_scale & no_effect_penalty
    Env-->>Agent: observation, shaped_reward, done, info
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I hop through gates with tiny paws,
I count errors and learn the laws,
I cache the noise beneath the moon,
Give tiny carrots, shape reward soon,
Hop, estimate, and hum a tune.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description covers the motivation (shaped reward signal), the implementation approach (exact vs approximate regimes), and provides clear context. However, the checklist items are all unchecked, suggesting incomplete documentation, tests, and changelog updates. Verify that all checklist items are actually completed despite being unchecked, or check them accordingly. Ensure tests and documentation updates are present.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately reflects the main objective: adding stepwise intermediate rewards for RL-based compilation, which is the central theme of all substantial changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad5ae0b and 0c42b41.

📒 Files selected for processing (13)
  • CHANGELOG.md (3 hunks)
  • docs/setup.md (1 hunks)
  • noxfile.py (3 hunks)
  • pyproject.toml (1 hunks)
  • src/mqt/predictor/ml/predictor.py (5 hunks)
  • src/mqt/predictor/reward.py (1 hunks)
  • src/mqt/predictor/rl/cost_model.py (1 hunks)
  • src/mqt/predictor/rl/parsing.py (1 hunks)
  • src/mqt/predictor/rl/predictorenv.py (10 hunks)
  • src/mqt/predictor/utils.py (1 hunks)
  • tests/compilation/test_predictor_rl.py (5 hunks)
  • tests/device_selection/test_predictor_ml.py (3 hunks)
  • tests/hellinger_distance/test_estimated_hellinger_distance.py (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-10-14T14:37:38.047Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.

Applied to files:

  • CHANGELOG.md
📚 Learning: 2025-11-05T09:23:46.540Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1287
File: src/qdmi/dd/Device.cpp:492-521
Timestamp: 2025-11-05T09:23:46.540Z
Learning: In the munich-quantum-toolkit/core repository, the `qasm3::Importer::imports()` function is backwards compatible with OpenQASM 2 programs. Therefore, it can be used to parse both QASM2 and QASM3 program formats without requiring separate importers for each version.

Applied to files:

  • src/mqt/predictor/rl/parsing.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

  • src/mqt/predictor/rl/parsing.py
  • src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-10-10T08:09:54.528Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: bindings/CMakeLists.txt:0-0
Timestamp: 2025-10-10T08:09:54.528Z
Learning: In the Munich Quantum Toolkit (MQT) Core project, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which automatically prefixes all CMake `DESTINATION` paths with `mqt/core/` during wheel installation. Therefore, CMake install destinations are relative to the `mqt/core` package namespace, not `site-packages`.

Applied to files:

  • pyproject.toml
📚 Learning: 2025-10-10T08:10:16.394Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: test/python/na/test_na_fomac.py:35-0
Timestamp: 2025-10-10T08:10:16.394Z
Learning: In the munich-quantum-toolkit/core repository, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which means CMake `install()` commands with `DESTINATION <path>` install files relative to `mqt/core/` in the wheel, making them accessible via `files("mqt.core").joinpath("<path>")`.

Applied to files:

  • pyproject.toml
📚 Learning: 2025-12-02T10:08:36.022Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: .github/workflows/cd.yml:40-48
Timestamp: 2025-12-02T10:08:36.022Z
Learning: In munich-quantum-toolkit workflows, the reusable-python-packaging-sdist.yml workflow uploads the sdist artifact with the name `cibw-sdist` (or `dev-cibw-sdist` for pull requests), which means it is covered by the `cibw-*` glob pattern when downloading artifacts for deployment.

Applied to files:

  • pyproject.toml
📚 Learning: 2025-11-27T16:58:08.564Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/ddsim PR: 720
File: noxfile.py:57-60
Timestamp: 2025-11-27T16:58:08.564Z
Learning: In Nox sessions, when using `session.install()` to install a tool and then `session.run(..., external=True)` to execute it, this pattern is valid because the session's virtual environment is activated and its bin directory is on PATH. The `external=True` flag allows calling executables on PATH, which includes the activated venv's bin directory, so it will find tools installed via `session.install()`.

Applied to files:

  • noxfile.py
🧬 Code graph analysis (3)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
src/mqt/predictor/ml/predictor.py (2)
  • compile_training_circuits (206-240)
  • generate_training_data (242-300)
src/mqt/predictor/reward.py (2)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
  • device (48-50)
tests/compilation/test_reward.py (1)
  • device (38-40)
tests/compilation/test_predictor_rl.py (2)
src/mqt/predictor/rl/cost_model.py (2)
  • canonical_cost (161-173)
  • get_cost_table (140-158)
src/mqt/predictor/rl/predictorenv.py (1)
  • calculate_reward (272-355)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)


241-241: Logging statement uses f-string

(G004)


269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)


666-668: try-except-continue detected, consider logging the exception

(S112)


666-666: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (19)
noxfile.py (2)

21-21: LGTM!

The pathlib import is necessary for the filesystem operations in the new _cleanup function.


112-113: LGTM!

The conditional cleanup call is appropriately placed and only runs in CI environments to save disk space. The pattern is consistent with other CI checks in the file.

CHANGELOG.md (1)

14-15: LGTM! Changelog entries follow the project template.

The changelog entries, PR links, and contributor reference are correctly formatted and consistent with the project's changelog template. Based on learnings, the format matches the expected template.

Also applies to: 49-50, 68-68

pyproject.toml (1)

97-98: LGTM! Packaging configuration correctly added.

The check-sdist configuration correctly specifies that _version.py should only be present in source distributions (sdist), which is the expected behavior for version files generated by hatch-vcs during the build process.

src/mqt/predictor/utils.py (1)

54-54: LGTM! Appropriate logging level for platform limitation.

The change from warnings.warn with RuntimeWarning to logger.info is appropriate. Since running without timeout on Windows is a known platform limitation rather than an error condition, an informational log message is more suitable than a warning.

tests/device_selection/test_predictor_ml.py (1)

46-46: LGTM! Test benchmark level updated consistently.

The change from BenchmarkLevel.ALG to BenchmarkLevel.INDEP is consistent with the PR objective and changelog entry about updating the test circuit level for the RL predictor. All test cases in this file have been updated consistently.

Also applies to: 66-66, 96-96

src/mqt/predictor/ml/predictor.py (4)

64-64: LGTM! Platform-specific workaround for multiprocessing issue.

The NO_PARALLEL constant correctly identifies the Windows + Python 3.13+ combination where joblib's "loky" backend is broken. This is a well-documented issue with the missing _posixsubprocess module.


232-235: LGTM! Clear documentation and appropriate fallback.

The comment clearly explains the Windows/Python 3.13 multiprocessing limitation, and the fallback to single-threaded execution (num_jobs=1) is the appropriate workaround for this platform issue.


275-276: LGTM! Consistent application of multiprocessing workaround.

The same NO_PARALLEL conditional is correctly applied to the generate_training_data method, ensuring consistent behavior across all parallel execution paths.


415-418: LGTM! Multiprocessing workaround applied to ML training.

The NO_PARALLEL conditional is correctly applied to the GridSearchCV training, ensuring the RandomForest model training also respects the Windows/Python 3.13 limitation.

docs/setup.md (1)

112-112: LGTM! Documentation updated to reflect new benchmark level.

The documentation example has been updated from BenchmarkLevel.ALG to BenchmarkLevel.INDEP, which is consistent with the changes in test files and the changelog entry about updating the test circuit level for the RL predictor.

tests/compilation/test_predictor_rl.py (3)

46-54: LGTM!

The switch from BenchmarkLevel.ALG to BenchmarkLevel.INDEP aligns with the PR's focus on device-independent circuits for RL training. The test logic remains correct.


108-115: LGTM!

Good addition of mapping verification. The test now validates both native gates and proper coupling map adherence, which is essential for the new dual-path reward system.


156-167: LGTM!

Comprehensive test coverage for cost model fallback behavior. The identity check (is TORINO_CANONICAL_COSTS) correctly verifies the exact table is returned rather than a copy.

src/mqt/predictor/rl/cost_model.py (2)

1-35: LGTM!

Well-documented module header and type definitions. The CanonicalCostTable type alias is appropriately defined as a Mapping for read-only access.


211-237: LGTM!

The approximate fidelity calculation is correct. The defensive max(n_1q, 0) guards and final clamping to [0, 1] ensure numerical robustness.

src/mqt/predictor/rl/predictorenv.py (3)

179-188: LGTM!

Good addition of instance variables for the dual-path reward system. The device averages cache (_p1_avg, _p2_avg, etc.) and reward tracking variables (prev_reward, prev_reward_kind) are appropriately initialized.


731-740: LGTM!

The method correctly checks both native gate basis and coupling map compliance. The logic is clear and matches the determine_valid_actions_for_state pattern.


458-464: Undefined method fom_aware_compile and attribute max_iter.

The code references self.fom_aware_compile and self.max_iter, but their definitions cannot be verified without access to the codebase. If these are not defined in the class or inherited from a parent class, this will raise an AttributeError at runtime when a stochastic action is applied.

Comment thread noxfile.py Outdated
Comment thread noxfile.py Outdated
Comment thread noxfile.py Outdated
Comment thread src/mqt/predictor/reward.py Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread tests/compilation/test_predictor_rl.py Outdated
Comment thread tests/hellinger_distance/test_estimated_hellinger_distance.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
src/mqt/predictor/rl/predictorenv.py (2)

269-269: Remove unused noqa directive.

Static analysis indicates SLF001 is not enabled, making this directive unnecessary.

-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout

256-264: Clarify symmetric reward scaling intent.

Both positive and negative delta branches multiply by reward_scale, while no_effect_penalty is only used for zero delta. The naming suggests asymmetric behavior but implementation is symmetric. Consider renaming for clarity:

-        self.reward_scale = 1
-        self.no_effect_penalty = 0.001
+        self.delta_scale = 1
+        self.no_progress_penalty = 0.001

And update usage accordingly. This makes the symmetric scaling intent explicit.

src/mqt/predictor/rl/cost_model.py (1)

300-301: Remove redundant initialization of t_hat.

Line 300 initializes t_hat = 0.0 but it's immediately overwritten on line 301. This is dead code.

-    t_hat = 0.0
     t_hat = (n_1q * tau1_avg + n_2q * tau2_avg) / k_eff
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c42b41 and 8259b4d.

📒 Files selected for processing (5)
  • src/mqt/predictor/reward.py (1 hunks)
  • src/mqt/predictor/rl/cost_model.py (1 hunks)
  • src/mqt/predictor/rl/predictorenv.py (10 hunks)
  • tests/compilation/test_predictor_rl.py (5 hunks)
  • tests/hellinger_distance/test_estimated_hellinger_distance.py (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-11-01T15:57:31.153Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.

Applied to files:

  • src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-10-09T13:20:11.483Z
Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.

Applied to files:

  • src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (4)
tests/hellinger_distance/test_estimated_hellinger_distance.py (2)
src/mqt/predictor/ml/predictor.py (2)
  • compile_training_circuits (206-240)
  • generate_training_data (242-300)
tests/device_selection/test_predictor_ml.py (2)
  • path_uncompiled_circuits (27-29)
  • path_compiled_circuits (33-35)
tests/compilation/test_predictor_rl.py (2)
src/mqt/predictor/rl/cost_model.py (2)
  • canonical_cost (171-183)
  • get_cost_table (150-168)
src/mqt/predictor/rl/predictorenv.py (1)
  • calculate_reward (272-356)
src/mqt/predictor/rl/predictorenv.py (5)
src/mqt/predictor/rl/actions.py (4)
  • CompilationOrigin (92-98)
  • DeviceDependentAction (136-146)
  • PassType (101-110)
  • Action (114-127)
src/mqt/predictor/rl/cost_model.py (2)
  • approx_estimated_success_probability (250-308)
  • approx_expected_fidelity (221-247)
src/mqt/predictor/rl/helper.py (1)
  • create_feature_dict (72-94)
src/mqt/predictor/rl/parsing.py (1)
  • postprocess_vf2postlayout (229-250)
src/mqt/predictor/utils.py (1)
  • calc_supermarq_features (93-145)
src/mqt/predictor/reward.py (2)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)
  • device (48-50)
tests/compilation/test_reward.py (1)
  • device (38-40)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)


241-241: Logging statement uses f-string

(G004)


269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (13)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

219-226: LGTM! API calls correctly updated to use keyword arguments.

The compile_training_circuits and generate_training_data calls now properly use keyword arguments, aligning with the updated function signatures in src/mqt/predictor/ml/predictor.py. The removal of num_workers is consistent with the new NO_PARALLEL constant handling parallelism internally.

src/mqt/predictor/rl/cost_model.py (4)

41-147: Well-structured canonical cost tables with clear documentation.

The per-device cost tables are well-organized with clear comments explaining the decomposition rationale. The registry pattern with DEVICE_CANONICAL_COSTS provides clean extensibility.


150-168: Safe fallback behavior with appropriate warnings.

The unknown-device fallback to Torino with both a warning and log message ensures visibility while maintaining functionality. The stacklevel=3 is appropriate for the call chain.


186-218: Robust gate cost estimation with conservative fallbacks.

The iteration pattern correctly uses the modern CircuitInstruction access. The conservative fallback for unknown gates (3×1q or 1×2q+4×1q) is reasonable for approximation purposes.


221-247: Clean implementation of approximate fidelity and ESP calculations.

Both functions properly clamp outputs to [0, 1] and handle edge cases. The ESP model's idle-time penalty with liveness modulation is a reasonable approximation approach.

Also applies to: 250-308

src/mqt/predictor/reward.py (1)

206-212: Proper handling of discrete time unit with safe fallback.

The device.dt or 1.0 fallback correctly handles backends without discrete-time calibration, preventing TypeError when device.dt is None. The dt-scaled duration is appropriately used in the exponential decay calculation.

tests/compilation/test_predictor_rl.py (3)

156-167: Good coverage for unknown-device fallback and unknown-gate defaults.

The test properly verifies that unknown devices trigger a UserWarning and fall back to TORINO_CANONICAL_COSTS, and that unknown gates return (0, 0). This aligns with the behavior defined in get_cost_table and canonical_cost.


170-219: Comprehensive test coverage for dual-path reward calculation.

The test covers all three modes (exact, approx, auto) for ESP and verifies that critical_depth always returns exact results. The addition of the auto-exact path test (lines 197-200) addresses the previous review suggestion. The monkeypatch approach for forcing the approx path is appropriate.


108-115: Good addition of mapping validation.

The CheckMap assertion ensures the compiled circuit respects the device's coupling map, complementing the existing native-gates check.

src/mqt/predictor/rl/predictorenv.py (4)

272-356: Well-designed dual-path reward calculation with proper mode handling.

The calculate_reward method cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto mode correctly uses _is_native_and_mapped to determine the appropriate path. The device_id parameter is now correctly passed to both approx_expected_fidelity and approx_estimated_success_probability.


608-730: Thorough device calibration caching with proper error handling.

The _ensure_device_averages_cached method properly aggregates 1q/2q error rates, durations, and coherence times. Key strengths:

  • Handles dt scaling for durations correctly
  • Tries flipped orientation for uni-directional couplings (lines 690-691)
  • Uses specific exceptions (KeyError, AttributeError) per past review feedback
  • Provides informative error messages when calibration data is missing

732-741: Clean helper for native/mapped circuit detection.

The _is_native_and_mapped method properly combines both checks using GatesInBasis and CheckMap passes, returning a boolean suitable for auto-mode reward selection.


190-270: Solid step-wise reward shaping implementation.

The refactored step() method implements the shaped reward signal as described in the PR objectives:

  • Pre-action reward evaluation establishes baseline
  • Post-action delta computation drives intermediate rewards
  • The prev_kind == "approx" and new_kind == "exact" check (line 253-254) correctly handles regime transitions
  • Terminal actions use exact metrics for final reward

One note: The logging statements at lines 210 and 241 use f-strings (flagged by Ruff G004), but this is a stylistic preference and doesn't affect functionality.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/mqt/predictor/rl/predictorenv.py (2)

269-269: Remove unused noqa directive.

Static analysis indicates this directive is unnecessary since SLF001 is not enabled in the linter configuration.

-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout

458-458: Type annotation mismatch persists.

The variable is annotated as PropertySet | None but initialized to {}. Since PropertySet is a Qiskit dict-like class, consider initializing to None and updating the downstream logic, or change the annotation to dict[str, Any] | None to match the runtime type.

-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None

This would require updating line 477/484 to handle the None initialization:

pm_property_set = dict(pm.property_set) if hasattr(pm, "property_set") else None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b5bf6f and 14aa3ac.

📒 Files selected for processing (2)
  • src/mqt/predictor/rl/predictorenv.py (10 hunks)
  • tests/compilation/test_predictor_rl.py (5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (6)
src/mqt/predictor/reward.py (4)
  • crit_depth (40-43)
  • esp_data_available (229-324)
  • estimated_success_probability (102-226)
  • expected_fidelity (46-74)
src/mqt/predictor/rl/actions.py (5)
  • CompilationOrigin (92-98)
  • DeviceDependentAction (136-146)
  • PassType (101-110)
  • get_actions_by_pass_type (462-467)
  • Action (114-127)
src/mqt/predictor/rl/cost_model.py (2)
  • approx_estimated_success_probability (250-308)
  • approx_expected_fidelity (221-247)
src/mqt/predictor/rl/helper.py (2)
  • create_feature_dict (72-94)
  • get_path_training_circuits (107-109)
src/mqt/predictor/rl/parsing.py (1)
  • postprocess_vf2postlayout (229-250)
src/mqt/predictor/utils.py (1)
  • calc_supermarq_features (93-145)
🪛 Ruff (0.14.7)
tests/compilation/test_predictor_rl.py

227-227: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)


241-241: Logging statement uses f-string

(G004)


269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
  • GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
  • GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
🔇 Additional comments (9)
src/mqt/predictor/rl/predictorenv.py (5)

179-188: LGTM!

The renamed variables (reward_scale, no_effect_penalty) are clearer than the previous names. The device-average caching variables are well-structured for the approximate reward path.


272-356: LGTM!

The dual-path reward calculation is well-structured with clear mode selection logic. The device_id is now correctly passed to the approximate reward functions, and the fallback behavior for unsupported reward functions is appropriately logged.


495-526: LGTM!

The refactored method properly accepts the property set as a parameter, improving testability and decoupling from the PassManager. The early return for falsy pm_property_set handles edge cases appropriately.


732-741: LGTM!

Clean helper method that encapsulates the native-and-mapped check logic. The use of Qiskit's analysis passes (GatesInBasis, CheckMap) is appropriate.


253-254: Important regime-switch handling.

Setting delta_reward = 0.0 when transitioning from approximate to exact mode prevents numerical artifacts from comparing metrics computed with different methods. This is a sound design choice.

tests/compilation/test_predictor_rl.py (4)

35-44: LGTM!

The new imports support testing the cost model fallback behavior and the dual-path reward computation. The TYPE_CHECKING pattern for MonkeyPatch is correctly used.


105-115: LGTM!

Adding explicit validation via GatesInBasis and CheckMap strengthens the test by verifying the compiled circuit meets device constraints. The descriptive assertion messages aid debugging.


156-167: LGTM!

Good coverage of cost model edge cases: unknown device fallback and unknown gate default. The test properly verifies both the warning emission and the returned values.


170-235: LGTM!

Comprehensive test coverage for the dual-path reward computation:

  • ESP exact/approx/auto modes
  • Auto mode selecting exact for native+mapped circuits (addresses previous review feedback)
  • RuntimeError on broken Target API
  • Critical depth always returning exact

The monkeypatch usage for forcing the approximate path is appropriate.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread tests/compilation/test_predictor_rl.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
src/mqt/predictor/rl/predictorenv.py (2)

269-269: Unused noqa directive.

Static analysis confirms SLF001 is not enabled. Remove the directive.

-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout

458-458: Type annotation mismatch.

The variable is annotated as PropertySet | None but initialized to an empty dict. Either initialize as None or change the annotation to dict[str, Any]:

-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None

Then adjust the stochastic branch return type expectation accordingly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14aa3ac and 04fde70.

📒 Files selected for processing (1)
  • src/mqt/predictor/rl/predictorenv.py (10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (4)
src/mqt/predictor/reward.py (1)
  • crit_depth (40-43)
src/mqt/predictor/rl/helper.py (1)
  • get_path_training_circuits (107-109)
src/mqt/predictor/rl/parsing.py (2)
  • final_layout_bqskit_to_qiskit (186-226)
  • postprocess_vf2postlayout (229-250)
src/mqt/predictor/utils.py (1)
  • calc_supermarq_features (93-145)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)


241-241: Logging statement uses f-string

(G004)


269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
  • GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
  • GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
🔇 Additional comments (3)
src/mqt/predictor/rl/predictorenv.py (3)

45-75: LGTM on import organization.

The imports are well-organized, with appropriate separation of type-checking imports and runtime imports. The new imports for approx_expected_fidelity, approx_estimated_success_probability, and calc_supermarq_features align with the dual-path reward computation feature.


272-356: Well-structured dual-path reward computation.

The calculate_reward method cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto-mode logic for selecting the appropriate path based on circuit state is sound. Previous feedback about missing device_id has been addressed.


495-526: LGTM on layout postprocessing refactor.

The method now accepts pm_property_set directly as a dict, which is cleaner than passing a PassManager. The early return on line 501-502 properly handles the None/empty case.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
src/mqt/predictor/rl/predictorenv.py (5)

17-18: Empty conditional block should be removed.

This empty block with just pass serves no purpose. If it was a placeholder for future type-only imports on Python 3.11+, add a comment explaining the intent; otherwise remove the block entirely.

-if sys.version_info >= (3, 11) and TYPE_CHECKING:  # pragma: no cover
-    pass

274-274: Remove unused noqa directive.

Static analysis flags this noqa: SLF001 as unused. The directive can be removed.

-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout

463-463: Type annotation mismatch persists.

The variable is annotated as PropertySet | None but initialized as {} (empty dict). Later it's assigned dict(pm.property_set). For type consistency, consider:

-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None

Then ensure the downstream code handles None appropriately (which it already does at line 506).


711-713: Consider logging a warning for partial calibration data.

The check raises RuntimeError when both p1 and p2 are empty, which is correct. However, if only one list is populated (e.g., a device with only single-qubit gates), a warning would be helpful for debugging.

         if not p1 and not p2:
             msg = "No valid 1q/2q calibration data found in Target, cannot compute approximate reward."
             raise RuntimeError(msg)
+        if not p1:
+            logger.warning("No single-qubit calibration data found; _p1_avg will be 0.0")
+        elif not p2:
+            logger.warning("No two-qubit calibration data found; _p2_avg will be 0.0")

737-746: Consider caching analysis pass instances for performance.

_is_native_and_mapped creates new GatesInBasis and CheckMap instances on every call. During RL training, this method is invoked frequently via calculate_reward. Caching these passes would improve performance.

Initialize cached passes in __init__:

# In __init__:
self._gates_in_basis_check: GatesInBasis | None = None
self._check_map: CheckMap | None = None

Then lazily initialize in _is_native_and_mapped:

 def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool:
-    check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names)
+    if self._gates_in_basis_check is None:
+        self._gates_in_basis_check = GatesInBasis(basis_gates=self.device.operation_names)
+    check_nat_gates = self._gates_in_basis_check
     check_nat_gates(qc)
     only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"]

-    check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+    if self._check_map is None:
+        self._check_map = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+    check_mapping = self._check_map
     check_mapping(qc)
     mapped = check_mapping.property_set["is_swap_mapped"]

     return bool(only_nat_gates and mapped)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04fde70 and 6a104b8.

📒 Files selected for processing (1)
  • src/mqt/predictor/rl/predictorenv.py (11 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (5)
src/mqt/predictor/reward.py (3)
  • crit_depth (40-43)
  • estimated_success_probability (102-226)
  • expected_fidelity (46-74)
src/mqt/predictor/rl/cost_model.py (2)
  • approx_estimated_success_probability (250-308)
  • approx_expected_fidelity (221-247)
src/mqt/predictor/rl/helper.py (2)
  • create_feature_dict (72-94)
  • get_path_training_circuits (107-109)
src/mqt/predictor/rl/parsing.py (3)
  • final_layout_bqskit_to_qiskit (186-226)
  • final_layout_pytket_to_qiskit (167-183)
  • postprocess_vf2postlayout (229-250)
src/mqt/predictor/utils.py (1)
  • calc_supermarq_features (93-145)
🪛 Ruff (0.14.7)
src/mqt/predictor/rl/predictorenv.py

214-214: Logging statement uses f-string

(G004)


245-245: Logging statement uses f-string

(G004)


274-274: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
  • GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
  • GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
🔇 Additional comments (6)
src/mqt/predictor/rl/predictorenv.py (6)

88-89: LGTM: Penalty sign corrected and parameters made configurable.

The no_effect_penalty is now correctly initialized as a negative value (-0.001), and both reward_scale and no_effect_penalty are exposed as constructor parameters for tuning during RL experimentation. This addresses previous review feedback.


257-259: Good handling of regime transition.

The comment clearly explains why delta_reward = 0.0 when transitioning from approximate to exact metrics—the two aren't directly comparable. This is the correct approach to avoid misleading reward signals.


277-319: Well-structured dual-path reward computation.

The calculate_reward method cleanly implements the exact vs. approximate reward paths with a clear mode selection mechanism. The fallback for unsupported reward functions with a warning is good defensive coding.


500-531: LGTM: Layout postprocessing signature updated correctly.

The method signature now accepts dict[str, Any] | None which aligns with how it's called (passing dict(pm.property_set)). The early return for falsy pm_property_set is a good guard.


229-231: Good defensive normalization for high-level gates.

Decomposing unitary and clifford gates before reward computation ensures the reward functions can properly analyze the circuit. This prevents errors from gates that don't have calibration data.


130-133: Verify whether this action registration order change breaks pre-trained model compatibility.

Registering OPT actions before LAYOUT actions changes the indices in action_set. While web search suggests PassType operates as a flexible tagging system (with the RL agent dynamically determining execution order), the action indices themselves may still be significant if:

  • Pre-trained models encode action indices as state features
  • Saved experience or replay buffers reference specific action indices
  • Model weights learned associations with particular action index ranges

Confirm whether existing trained agents or datasets depend on action registration indices, or if this reordering is safe.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pyproject.toml (1)

268-269: Remove duplicate [tool.check-sdist] configuration.

This section duplicates lines 97-98. TOML does not allow duplicate table headers, and parsers will reject this file, causing build failures.

Apply this diff to remove the duplicate:

-[tool.check-sdist]
-sdist-only = ["src/mqt/predictor/_version.py"]
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a104b8 and ae43600.

📒 Files selected for processing (2)
  • CHANGELOG.md (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-10-10T08:09:54.528Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: bindings/CMakeLists.txt:0-0
Timestamp: 2025-10-10T08:09:54.528Z
Learning: In the Munich Quantum Toolkit (MQT) Core project, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which automatically prefixes all CMake `DESTINATION` paths with `mqt/core/` during wheel installation. Therefore, CMake install destinations are relative to the `mqt/core` package namespace, not `site-packages`.

Applied to files:

  • pyproject.toml
📚 Learning: 2025-10-10T08:10:16.394Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: test/python/na/test_na_fomac.py:35-0
Timestamp: 2025-10-10T08:10:16.394Z
Learning: In the munich-quantum-toolkit/core repository, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which means CMake `install()` commands with `DESTINATION <path>` install files relative to `mqt/core/` in the wheel, making them accessible via `files("mqt.core").joinpath("<path>")`.

Applied to files:

  • pyproject.toml
📚 Learning: 2025-12-02T10:08:36.022Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: .github/workflows/cd.yml:40-48
Timestamp: 2025-12-02T10:08:36.022Z
Learning: In munich-quantum-toolkit workflows, the reusable-python-packaging-sdist.yml workflow uploads the sdist artifact with the name `cibw-sdist` (or `dev-cibw-sdist` for pull requests), which means it is covered by the `cibw-*` glob pattern when downloading artifacts for deployment.

Applied to files:

  • pyproject.toml
📚 Learning: 2025-10-14T14:37:38.047Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.

Applied to files:

  • CHANGELOG.md
📚 Learning: 2025-12-05T17:45:37.602Z
Learnt from: denialhaag
Repo: munich-quantum-toolkit/core PR: 1360
File: .github/workflows/reusable-mlir-tests.yml:40-43
Timestamp: 2025-12-05T17:45:37.602Z
Learning: In the munich-quantum-toolkit/core repository, patch releases of LLVM dependencies don't require documentation updates, changelog entries, or additional tests beyond what's validated by passing CI checks.

Applied to files:

  • CHANGELOG.md
📚 Learning: 2025-12-01T11:00:40.342Z
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: CHANGELOG.md:18-18
Timestamp: 2025-12-01T11:00:40.342Z
Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the CHANGELOG.md intentionally references the parent MQT Core repository's release notes (https://github.com/munich-quantum-toolkit/core/releases) because the plugin repository is based on work performed in the parent repository.

Applied to files:

  • CHANGELOG.md
🔇 Additional comments (3)
pyproject.toml (2)

97-98: LGTM!

The check-sdist configuration correctly ensures the hatch-vcs generated version file is included in source distributions.


130-130: LGTM!

The warning filter appropriately suppresses expected warnings when tests exercise devices without defined canonical cost tables, consistent with the fallback behavior introduced in this PR.

CHANGELOG.md (1)

50-51: Changelog entries and PR links look good.

Both new changelog entries are properly added to the Unreleased section, and the corresponding PR link references are correctly defined in the PR links section. The entries follow the existing format and conventions used elsewhere in the file. Once the formatting issue on line 15 is fixed, this will be ready.

Comment thread CHANGELOG.md Outdated
@Shaobo-Zhou
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 17, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/mqt/predictor/rl/predictorenv.py (1)

462-498: Consider clarifying the type annotation for pm_property_set.

Line 463 declares pm_property_set: PropertySet | None but initializes it to {} (an empty dict). The downstream code (lines 482, 489) converts pm.property_set to dict[str, Any], and line 503 checks for truthiness rather than None. For consistency:

-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None

Then ensure downstream code handles None appropriately (which it appears to do at line 506).

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae43600 and b2197b3.

📒 Files selected for processing (3)
  • CHANGELOG.md (2 hunks)
  • pyproject.toml (1 hunks)
  • src/mqt/predictor/rl/predictorenv.py (11 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-10-14T14:37:38.047Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.

Applied to files:

  • CHANGELOG.md
📚 Learning: 2025-12-05T17:45:37.602Z
Learnt from: denialhaag
Repo: munich-quantum-toolkit/core PR: 1360
File: .github/workflows/reusable-mlir-tests.yml:40-43
Timestamp: 2025-12-05T17:45:37.602Z
Learning: In the munich-quantum-toolkit/core repository, patch releases of LLVM dependencies don't require documentation updates, changelog entries, or additional tests beyond what's validated by passing CI checks.

Applied to files:

  • CHANGELOG.md
📚 Learning: 2025-12-01T11:00:40.342Z
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: CHANGELOG.md:18-18
Timestamp: 2025-12-01T11:00:40.342Z
Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the CHANGELOG.md intentionally references the parent MQT Core repository's release notes (https://github.com/munich-quantum-toolkit/core/releases) because the plugin repository is based on work performed in the parent repository.

Applied to files:

  • CHANGELOG.md
📚 Learning: 2025-11-27T21:26:39.677Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
📚 Learning: 2025-11-04T14:26:25.420Z
Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

  • src/mqt/predictor/rl/predictorenv.py
🧬 Code graph analysis (1)
src/mqt/predictor/rl/predictorenv.py (6)
src/mqt/predictor/reward.py (3)
  • crit_depth (40-43)
  • estimated_success_probability (102-226)
  • expected_fidelity (46-74)
src/mqt/predictor/rl/actions.py (3)
  • CompilationOrigin (92-98)
  • PassType (101-110)
  • Action (114-127)
src/mqt/predictor/rl/cost_model.py (2)
  • approx_estimated_success_probability (250-308)
  • approx_expected_fidelity (221-247)
src/mqt/predictor/rl/helper.py (1)
  • create_feature_dict (72-94)
src/mqt/predictor/rl/parsing.py (1)
  • final_layout_bqskit_to_qiskit (186-226)
src/mqt/predictor/utils.py (1)
  • calc_supermarq_features (93-145)
🪛 Ruff (0.14.8)
src/mqt/predictor/rl/predictorenv.py

214-214: Logging statement uses f-string

(G004)


245-245: Logging statement uses f-string

(G004)


274-274: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
  • GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
  • GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
🔇 Additional comments (8)
pyproject.toml (1)

127-128: LGTM! Test warning filter appropriately scoped.

The new warning filter suppresses expected UserWarning messages from the canonical cost model when testing with devices that lack predefined cost tables. This is a reasonable approach to reduce test noise while still catching unexpected warnings.

CHANGELOG.md (2)

14-15: LGTM! Changelog entries follow project conventions.

Both entries are properly formatted with the correct template structure. The previous formatting issue has been resolved.


50-51: LGTM! PR link references are correct.

The PR links are properly formatted and correspond to the new changelog entries.

src/mqt/predictor/rl/predictorenv.py (5)

25-26: LGTM! Import additions support new dual-path reward functionality.

The new imports (PropertySet, CheckMap, GatesInBasis, cost_model functions, calc_supermarq_features) are necessary for the dual-path reward computation and device-averages caching introduced in this PR.

Also applies to: 45-48, 58-69, 75-75


88-89: LGTM! Reward shaping parameters are now configurable.

The new reward_scale and no_effect_penalty parameters allow tuning reward behavior during RL training. The negative default for no_effect_penalty correctly penalizes actions that produce no improvement. The instance variables properly support the dual-path reward computation and device-averages caching.

Also applies to: 97-98, 183-192


194-276: LGTM! Dual-path reward computation is well-structured.

The rewritten step() method correctly implements shaped intermediate rewards with proper handling of:

  • Exact vs. approximate reward paths
  • Regime transitions (zeroing delta when switching from approx to exact)
  • Symmetric scaling for positive/negative deltas
  • No-effect penalty for actions that don't change the circuit

The logic is clear and the approach aligns with the PR objectives.

Minor: Remove unused noqa directive and use lazy logging.

Static analysis flags three minor issues:

  1. Line 274: The # noqa: SLF001 directive is unused and should be removed.
  2. Lines 214, 245: f-string logging should use lazy % formatting for performance (e.g., logger.info("Applying %s", self.action_set[action].name)).

Apply this diff:

-        logger.info(f"Applying {self.action_set[action].name}")
+        logger.info("Applying %s", self.action_set[action].name)
         
         # ... (code omitted)
         
-            logger.info(f"Final reward ({final_kind}): {final_val}")
+            logger.info("Final reward (%s): %s", final_kind, final_val)
             
         # ... (code omitted)
         
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout

277-361: LGTM! Dual-path reward calculation is well-implemented.

The calculate_reward() method correctly implements the dual-path (exact vs. approximate) reward computation:

  • The mode parameter allows flexible control over which path to use
  • The "auto" mode intelligently selects based on whether the circuit is native and mapped
  • Device-specific cost tables are correctly selected via device_id parameter
  • Unsupported reward functions gracefully fall back to exact computation with a warning

The implementation aligns well with the PR objectives for shaped intermediate rewards.


500-531: LGTM! Layout postprocessing refactor aligns with property set handling.

The signature change to accept dict[str, Any] | None instead of a PassManager is a cleaner design that decouples layout extraction from the PassManager object. The logic correctly handles missing property sets and properly constructs the TranspileLayout.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Collaborator

@flowerthrower flowerthrower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Shaobo-Zhou once again thank you very much for your continued efforts - LGTM! There is really only a small nitpick from my end left.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8a80d2 and 7810d4d.

📒 Files selected for processing (1)
  • src/mqt/predictor/rl/cost_model.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-09T13:20:11.483Z
Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.

Applied to files:

  • src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-11-01T15:57:31.153Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.

Applied to files:

  • src/mqt/predictor/rl/cost_model.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
  • GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
  • GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
🔇 Additional comments (5)
src/mqt/predictor/rl/cost_model.py (5)

1-36: LGTM: Clean module structure and imports.

The module-level organization is solid: clear documentation, appropriate use of TYPE_CHECKING for forward references, and a logger for diagnostics. The CanonicalCostTable type alias using Mapping correctly conveys immutability.


42-148: LGTM: Device-specific cost tables are well-documented.

The canonical cost tables correctly reflect different native gate sets across devices (CZ for Torino, iSWAP for Ankaa3, CZ for Emerald). The decomposition comments (e.g., CX → H•CZ•H ⇒ 6×1q + 1×2q) provide clear rationale. Gate coverage differences are intentional and align with each device's architecture.


151-169: LGTM: Safe fallback with appropriate warnings.

The unknown-device fallback to Torino is well-communicated via both warnings.warn and logger. The dual notification ensures visibility in different deployment contexts.


172-185: LGTM: Simple and correct delegation.

The function correctly wraps get_cost_table and provides the documented (0, 0) fallback for unknown gates.


223-249: LGTM: Correct fidelity approximation with defensive bounds.

The multiplicative fidelity model F = (1−p₁)^n₁ · (1−p₂)^n₂ is standard for independent gate errors. Defensive clamping and max() calls ensure numerical robustness.

Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
@burgholzer burgholzer added the feature New feature or request label Dec 25, 2025
Copy link
Copy Markdown
Member

@burgholzer burgholzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shaobo-Zhou just did a very quick review here in order to not hold this up for too long.
I think this will probably need one more round of iteration until it is ready to be merged. At the moment it feels like it's cutting a few corners too many.

Edit: There were also quite some CodeRabbit comments left, which should ideally be resolved.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread tests/compilation/test_predictor_rl.py Outdated
Comment thread docs/setup.md Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread pyproject.toml Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread CHANGELOG.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@CHANGELOG.md`:
- Around line 55-58: The CHANGELOG.md contains a link definition for [`#449`] that
is not referenced; either remove the unused link definition "[`#449`]:
https://github.com/munich-quantum-toolkit/predictor/pull/449" or add a
corresponding entry in the Unreleased changelog that references [`#449`] so
markdownlint no longer flags it as unused—update the Unreleased section to
include a line like "PR [`#449`] - <short description>" or delete the [`#449`] link
definition entirely.

In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 171-172: The current code uses QuantumCircuit.size() and writes
cost_table[...] = (qc_trans.size(), 0), which incorrectly treats all transpiled
basis gates as 1-qubit; instead iterate qc_trans.data (or
qc_trans.count_ops()/instructions) to compute separate 1-qubit and 2-qubit
counts and store them as the tuple (n_1q, n_2q) in cost_table. Locate the
transpile call and the assignment to cost_table (qc_trans and cost_table
entries) and replace the single size() value with explicit counting by
instruction arity (e.g., count len(instr.qargs) == 1 vs == 2) so downstream
fidelity/ESP uses both 1q and 2q contributions.
- Around line 150-167: The add_gate_to_cost_table function is double-prefixing
gates like "crx"/"cu3"/"ccx" causing invalid method names; fix by detecting when
the supplied gate string already encodes control prefixes and avoid prepending
"c"*controls in that case (e.g., if gate.startswith("c") then use gate as-is or
strip/interpret its inherent control count rather than adding extra "c"s), and
for the multi-control group (entries like "ccx","c3x","c4x") either replace them
with their base gate names plus the correct controls (e.g., "x" with
controls=2/3/4) or list each gate with the accurate controls count so
add_gate_to_cost_table uses the correct method name (refer to
add_gate_to_cost_table, the gate list entries and the logic building gate_name).

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 312-321: calculate_reward currently falls through to the fallback
for the "estimated_hellinger_distance" mode because there is no explicit branch;
add a branch inside the same block (where crit_depth(qc) is handled) that checks
if self.reward_function == "estimated_hellinger_distance" and returns the
computed Hellinger reward and an appropriate tag (e.g.,
(estimated_hellinger_distance(qc), "exact") or call the existing helper used
elsewhere if named differently), so the estimator is used instead of returning
0.0 and logging a warning.
- Around line 484-489: The routing branch currently assumes pm_property_set
contains "final_layout" and directly assigns self.layout.final_layout =
pm_property_set["final_layout"], which can raise KeyError because
pm_property_set may be an empty dict; update the guard to verify the key exists
(e.g., use pm_property_set.get("final_layout") or an explicit 'if "final_layout"
in pm_property_set') before assigning, and only set self.layout.final_layout
when a non-None value is returned; target the branch that checks action_index in
self.actions_routing_indices and uses self.layout and pm_property_set.
- Around line 222-230: The current termination check "if not altered_qc" treats
empty QuantumCircuit objects as falsy and prematurely ends episodes; change the
check to explicitly test for None (i.e., "if altered_qc is None") so only
intentional termination from apply_action triggers episode end, leaving empty
but valid circuits to continue; update the branch that returns
create_feature_dict(self.state), 0.0, True, False, {} to run only when
altered_qc is None and keep using apply_action and create_feature_dict as
before.

Comment thread CHANGELOG.md
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mqt/predictor/reward.py (1)

207-212: ⚠️ Potential issue | 🟠 Major

Address incorrect exponential decay calculation when device.dt is unavailable.

The fallback device.dt or 1.0 is semantically incorrect. In Qiskit, instruction.duration is measured in time steps (units of dt), and device.dt converts those steps to seconds. Using 1.0 assumes each time step equals 1 second, but typical device.dt values are ~2.22e-10 seconds (nanoseconds). This causes the decay calculation to be off by ~10 orders of magnitude when dt is unavailable.

Qiskit's Target.dt is legitimately None for some backends (not a rare edge case). Either:

  1. Skip the exponential decay term for backends without calibrated dt
  2. Log a warning and use a more conservative estimate
  3. Propagate None and handle it at a higher level
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/cost_model.py`:
- Line 119: The logging call in logger.warning currently uses an f-string which
forces formatting even when the warning level may be disabled; update the call
in the cost generation code (reference: the logger.warning line in
src/mqt/predictor/rl/cost_model.py that mentions "No basis for device") to use
lazy %-style or parameterized logging (e.g., logger.warning("No basis for device
'%s', using minimal universal basis for cost generation.", device_id)) so the
string is only formatted when the message will actually be emitted.
- Around line 167-168: Replace the bare "except Exception:" in
src/mqt/predictor/rl/cost_model.py with a specific exception catch for missing
gate methods (e.g., "except AttributeError:"), or a small explicit tuple like
"except (AttributeError, NotImplementedError):" so you only silence expected
missing-method errors while allowing other bugs (TypeError, ValueError, etc.) to
surface; update the except block that currently returns on error to use the
narrowed exception types.
- Around line 105-107: The if-check that tests "if transpile is None or
QuantumCircuit is None" is unreachable because those symbols are imported at
module level; remove that conditional and the accompanying ImportError so the
function no longer checks for None, or if you intended lazy importing, instead
move the Qiskit imports (transpile, QuantumCircuit) into the function and wrap
them in a try/except ImportError that raises the same message—update the code
around the function that references transpile/QuantumCircuit accordingly
(referenced symbols: transpile, QuantumCircuit in cost_model.py).
- Around line 170-171: The cost_table currently stores a flat total with a zero
second element; update generate_cost_table to compute and store the tuple (n_1q,
n_2q) by using the transpiled circuit qc_trans and the existing
estimate_basis_gate_counts utility and TWO_Q_GATES set: call
estimate_basis_gate_counts(qc_trans, basis_gates) (or the appropriate signature)
to get per-basis counts, sum counts for gates in TWO_Q_GATES to produce n_2q and
sum the remaining counts for n_1q, then assign cost_table[gate if controls == 0
else ("c" * controls) + gate] = (n_1q, n_2q) instead of (qc_trans.size(), 0).

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 621-622: The getattr call checking "_dev_avgs_cached" is redundant
because __init__ already sets self._dev_avgs_cached = False; replace
getattr(self, "_dev_avgs_cached", False) with a direct attribute access
(self._dev_avgs_cached) in the code path shown, so the condition becomes if
self._dev_avgs_cached: return; keep the existing __init__ initialization as the
source of truth and remove the defensive default.
- Line 250: Replace the eager f-string logging call logger.info(f"Final reward
({self.prev_reward_kind}): {self.prev_reward}") with lazy logging formatting to
avoid unnecessary string interpolation; use logger.info("Final reward (%s): %s",
self.prev_reward_kind, self.prev_reward) so that self.prev_reward_kind and
self.prev_reward are passed as parameters and only formatted if the log level is
enabled.
- Line 221: Replace the f-string logging call with lazy logging formatting to
avoid eager interpolation: in the code that currently calls
logger.info(f"Applying {self.action_set[action].name}") (within PredictorEnv /
predictorenv.py, using the logger, self.action_set and action variables), call
logger.info with a format string and pass self.action_set[action].name as an
argument (e.g., "Applying %s", self.action_set[action].name) so interpolation is
deferred until the log is emitted.

Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mqt/predictor/utils.py (1)

49-55: ⚠️ Potential issue | 🟡 Minor

Update docstring: RuntimeWarning is no longer raised.

The Windows branch now logs and proceeds without raising; the Raises section is out of sync and should be corrected.

✏️ Suggested docstring update
-    Raises:
-        RuntimeWarning: If the timeout is not supported on the current platform (e.g., Windows).
-        TimeoutExceptionError: If the function call exceeds the timeout limit.
+    Raises:
+        TimeoutExceptionError: If the function call exceeds the timeout limit (non-Windows).
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 33-58: Define the missing TORINO_CANONICAL_COSTS constant and
implement a canonical_cost function and unknown-device fallback used by
get_cost_table: add a TORINO_CANONICAL_COSTS mapping containing the canonical
cost table for the Torino device, implement canonical_cost(device_name: str) ->
CanonicalCostTable to return a known mapping for recognized names and otherwise
log a warning and return TORINO_CANONICAL_COSTS as the default; modify
get_cost_table to call canonical_cost when the device is not in
DEVICE_COST_CACHE/KNOWN_DEVICE_BASES instead of generating a new random table,
and ensure TORINO_CANONICAL_COSTS, canonical_cost, and get_cost_table are
exported/defined in the module so tests can import them.

Comment thread src/mqt/predictor/rl/cost_model.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/approx_reward.py`:
- Around line 62-63: The code repeatedly calls estimate_basis_gate_counts (which
internally transpiles) for the same circuit state; modify calculate_reward
(and/or predictorenv.py's step) to memoize the basis-gate counts keyed by a
canonical representation of the quantum circuit (e.g., serialized QASM or a hash
of qc.data + device target), reuse the cached result for subsequent calls, and
fall back to calling get_basis_gates_from_target and estimate_basis_gate_counts
only on cache misses; ensure the cache key and lookup are used wherever basis =
get_basis_gates_from_target(device) and counts = estimate_basis_gate_counts(qc,
basis_gates=basis) are invoked so duplicate transpilation is avoided.
- Around line 30-40: The loop in estimate_basis_gate_counts uses deprecated
tuple unpacking of qc_t.data; replace it by iterating over CircuitInstruction
objects and access attributes directly (operation/qubits/clbits). Concretely, in
estimate_basis_gate_counts change the loop to iterate like "for circ_instr in
qc_t.data:", extract the gate with "instr = circ_instr.operation" (and use
circ_instr.qubits / circ_instr.clbits if needed), then use instr.name and the
existing BLACKLIST/counts logic.

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 259-261: The code only suppresses delta_reward when transitioning
from "approx"→"exact"; change the logic in the reward computation (referencing
prev_reward_kind, new_kind, and delta_reward in predictorenv.py) to suppress the
delta for any regime change (i.e., if prev_reward_kind != new_kind then set
delta_reward = 0.0) and ensure prev_reward_kind is still updated after this
check so later steps see the new regime; adjust the conditional that currently
checks (self.prev_reward_kind == "approx" and new_kind == "exact") to a generic
regime-change check and keep existing comments about metrics comparability.
- Around line 623-624: Replace the local duplicate gate_blacklist in
_ensure_device_averages_cached with the shared BLACKLIST constant from
approx_reward to avoid divergence; import BLACKLIST from the approx_reward
module and use it when building basis_ops (i.e., change basis_ops = [name for
name in op_names if name not in gate_blacklist] to use BLACKLIST), removing the
local {"measure","reset","delay","barrier"} definition.
- Around line 263-268: The two branches computing reward_val in predictorenv.py
are identical; replace the if/elif block that compares delta_reward with a
single assignment using the existing symbols: set reward_val = self.reward_scale
* delta_reward (preserving delta_reward == 0.0 result) inside the same function
where delta_reward and reward_val are used so you remove the redundant
conditional logic.
- Around line 218-229: Calculate and assign prev_reward and prev_reward_kind
only when the chosen action is non-terminal: move the call to calculate_reward
from before apply_action into the branch that handles non-terminal actions
(i.e., the branch that proceeds after altered_qc is not None and not the
terminate action), so terminal/terminate actions skip the expensive pre-action
evaluation; ensure apply_action, calculate_reward, and the used_actions logic
remain unchanged. Also initialize prev_reward and prev_reward_kind in reset()
(or add a guard before use) so the first step has defined values when no prior
reward exists.

Comment thread src/mqt/predictor/rl/approx_reward.py
Comment thread src/mqt/predictor/rl/approx_reward.py
Comment thread src/mqt/predictor/rl/predictorenv.py
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 268-352: Update the calculate_reward signature to narrow the mode
parameter from plain str to a Literal type (Literal["auto", "exact", "approx"])
to catch invalid modes at type-check time; add the appropriate import for
Literal (from typing or typing_extensions depending on project compatibility)
and update any affected type hints/exports so static checkers (mypy/pyright)
pick it up—no logic changes needed inside calculate_reward or its use of
self._is_native_and_mapped, expected_fidelity,
approx_estimated_success_probability, etc.

Comment thread src/mqt/predictor/rl/predictorenv.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)

104-104: 🧹 Nitpick | 🔵 Trivial

Use lazy logging formatting for consistency.

String concatenation in the log call is inconsistent with the %s-style formatting used elsewhere in this file (e.g., lines 217, 271). Ruff rule G003 discourages eager string building in logging calls.

♻️ Suggested fix
-        logger.info("Init env: " + reward_function)
+        logger.info("Init env: %s", reward_function)
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 244-245: The call to calculate_reward at the start of each step is
redundant because self.prev_reward and self.prev_reward_kind already hold the
post-action reward from the previous step; remove the redundant invocation in
the step logic (the line calling calculate_reward that sets
self.prev_reward/self.prev_reward_kind) and instead initialize/seed
self.prev_reward and self.prev_reward_kind during reset() by calling
calculate_reward() once there, ensuring the end-of-step code that updates
prev_reward/prev_reward_kind (lines around where post-action reward is set)
continues to overwrite for the next step; adjust any references in step() and
reset() accordingly to reuse the carried-over prev_reward/prev_reward_kind
without recomputing on each new step.
- Around line 726-735: _determine_valid_actions_for_state and
_is_native_and_mapped duplicate expensive GatesInBasis/CheckMap work; cache the
pass instances and share computed results to avoid repeated checks. Create
cached pass instances on self (e.g., self._gates_in_basis and self._check_map)
initialized once and reused in both determine_valid_actions_for_state and
_is_native_and_mapped, and have determine_valid_actions_for_state store the last
computed flags (e.g., self._last_only_nat_gates and self._last_mapped) after
running the checks so _is_native_and_mapped can return those cached booleans if
available instead of re-running the passes; fall back to running the cached pass
instances if no cached results exist and update the cache.
- Around line 218-242: The estimated_hellinger_distance branch duplicates the
apply→decompose→update-state→check-valid-actions logic found later; extract that
shared sequence into a helper (e.g. _apply_and_update or
_apply_action_and_update_state) which calls apply_action, performs the gate-type
decompose loop, sets self.state and self.state._layout, increments
self.num_steps, updates self.valid_actions via determine_valid_actions_for_state
and raises RuntimeError if empty; then replace the duplicated block in the
reward_function == "estimated_hellinger_distance" branch to call the helper and
only keep the branch-specific reward logic (calculate_reward(mode="exact") vs
no_effect_penalty) before returning create_feature_dict(self.state), reward,
done, False, {}.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/compilation/test_predictor_rl.py`:
- Around line 149-204: Tests test_approx_reward_ef and test_approx_reward_esp
duplicate setup and assertions; replace them with a single parametrized test
using pytest.mark.parametrize over figure_of_merit values ("expected_fidelity"
and "estimated_success_probability") that creates Predictor(figure_of_merit=...)
and monkeypatches predictor.env._is_native_and_mapped the same way, runs val,
kind = predictor.env.calculate_reward(...), asserts common conditions (kind ==
"approx", 0<=val<=1, predictor.env._dev_avgs_cached, isinstance checks and
len(predictor.env._err_by_gate)>0, len(predictor.env._dur_by_gate)>0), and then
include the ESP-specific assertions (assert len(..._dur_by_gate)>0 and assert
predictor.env._tbar is None or predictor.env._tbar > 0.0) only when the
parameter == "estimated_success_probability"; keep references to Predictor,
predictor.env.calculate_reward, _is_native_and_mapped, _err_by_gate,
_dur_by_gate, and _tbar to locate the code.

Comment thread tests/compilation/test_predictor_rl.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)

562-576: 🧹 Nitpick | 🔵 Trivial

determine_valid_actions_for_state creates fresh pass instances while _native_and_mapped reuses cached ones.

Both methods perform the same GatesInBasis + CheckMap checks. determine_valid_actions_for_state still instantiates new objects each call (lines 564, 574), while _native_and_mapped uses the cached instances from __init__. Consider reusing the cached instances here too, or calling _native_and_mapped internally.

🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 705-717: The cache in _native_and_mapped uses id(qc) which can be
recycled and cause stale hits; either remove the cache entirely and always run
_gates_in_basis_check and _check_map, or replace the id-based key with a safe
version counter: add a _state_version integer on the class (initialize in
__init__), increment _state_version whenever self.state is reassigned, store and
compare _last_state_version (instead of _last_qc_id) alongside
_last_native_mapped in _native_and_mapped, and update that version when you set
self.state so the cached (only_native, mapped) is only reused for the current
state.

In `@tests/compilation/test_predictor_rl.py`:
- Around line 175-178: The condition is checking the imported type alias
figure_of_merit instead of the test parameter fom, so the ESP-specific
assertions (predictor.env._dur_by_gate and predictor.env._tbar) never run;
update the conditional to use the test parameter (if fom ==
"estimated_success_probability") so the assertions execute for the ESP reward
path and validate duration/tbar caching for predictor.env._dur_by_gate and
predictor.env._tbar.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread tests/compilation/test_predictor_rl.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 636-639: The suppress(KeyError, AttributeError, TypeError) around
the basis_ops loop is too broad and can hide legitimate TypeError bugs when
calling target.operation_from_name or reading op.num_qubits; change the
exception list to (KeyError, AttributeError) (i.e., remove TypeError) and
optionally add a debug/log line inside the except path that records the gate
name and the caught exception so skipped gates are visible; locate the loop
using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to
make the change.
- Around line 238-243: The code path for reward_function ==
"estimated_hellinger_distance" only gives the episode terminal step a real
reward and assigns no_effect_penalty for every non-terminal step, which leaves
the agent without shaping signal; update the branch in predictorenv.py (the
block checking self.reward_function == "estimated_hellinger_distance") to either
implement a non-terminal shaping heuristic or, if this is intentional, add a
concise explanatory comment above the block referencing that choice and the
involved symbols (self.reward_function, "estimated_hellinger_distance",
self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index)
so readers know why calculate_reward is only called at done and why non-terminal
steps receive no signal.

In `@tests/compilation/test_predictor_rl.py`:
- Around line 165-170: The monkeypatch targets a non-existent method
`_is_native_and_mapped` and has no effect; update the test to force the
approximate path by calling calculate_reward with mode="approx" instead of
relying on device mapping heuristics. Locate the call to
predictor.env.calculate_reward(qc=qc, mode="auto") in the test and change the
mode argument to "approx" (remove the monkeypatch line that sets
`_is_native_and_mapped`), ensuring the test exercises the approximate reward
branch in calculate_reward rather than depending on GatesInBasis/CheckMap
behavior.

Comment on lines +238 to +243
done = action == self.action_terminate_index

if self.reward_function == "estimated_hellinger_distance":
reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty
self.state._layout = self.layout # noqa: SLF001
return create_feature_dict(self.state), reward_val, done, False, {}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Hellinger path provides no shaping signal — non-terminal steps always get a flat penalty.

This is a design choice, but it means the RL agent receives zero useful gradient signal during an episode for estimated_hellinger_distance — reward information only comes at termination. If this is intentional (e.g., because there's no approximate Hellinger model), a brief comment explaining the rationale would help future readers.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 242-242: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In `@src/mqt/predictor/rl/predictorenv.py` around lines 238 - 243, The code path
for reward_function == "estimated_hellinger_distance" only gives the episode
terminal step a real reward and assigns no_effect_penalty for every non-terminal
step, which leaves the agent without shaping signal; update the branch in
predictorenv.py (the block checking self.reward_function ==
"estimated_hellinger_distance") to either implement a non-terminal shaping
heuristic or, if this is intentional, add a concise explanatory comment above
the block referencing that choice and the involved symbols
(self.reward_function, "estimated_hellinger_distance", self.no_effect_penalty,
self.calculate_reward, and self.action_terminate_index) so readers know why
calculate_reward is only called at done and why non-terminal steps receive no
signal.

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flowerthrower You're absolutely right — this does look like a design oversight rather than an intentional choice.

The current code assigns no_effect_penalty to every non-terminal step when using estimated_hellinger_distance, regardless of whether the circuit actually improved or degraded. This is inconsistent with how the environment handles expected_fidelity and estimated_success_probability below (lines 244+), where:

  1. The previous reward is cached
  2. A new reward is computed after each action
  3. The step reward is the scaled delta (reward_scale * (new_val - prev_reward))
  4. Only when delta == 0.0 does it fall back to no_effect_penalty

The absence of an approximate Hellinger model simply means you can't use the transpile-based proxy — but you can still compute the exact Hellinger distance at each step via self.calculate_reward(mode="exact") and shape rewards based on the delta, just like EF/ESP do.

Suggested fix:
Remove the special-case early return for Hellinger and let it flow through the same delta-based shaping logic below, but always call calculate_reward(mode="exact") (since mode="auto" would also choose "exact" anyway, as there's no approximate path). That would give the agent meaningful gradient signal during the episode.


✏️ Learnings added
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/predictor PR: 526
File: src/mqt/predictor/rl/predictorenv.py:238-243
Timestamp: 2026-02-09T13:15:50.387Z
Learning: When using reward_function == "estimated_hellinger_distance" in PredictorEnv (src/mqt/predictor/rl/predictorenv.py), the current implementation only computes the exact reward at episode termination and assigns no_effect_penalty to all non-terminal steps, which prevents intermediate shaping signal. This is likely unintentional; the Hellinger distance can be computed exactly at each step and used for delta-based reward shaping, similar to expected_fidelity and estimated_success_probability.
</add_learning>

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread tests/compilation/test_predictor_rl.py Outdated
Copy link
Copy Markdown
Collaborator

@flowerthrower flowerthrower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, great work so far. Only a few comments left from my end.

Comment thread src/mqt/predictor/rl/approx_reward.py Outdated
Comment on lines +238 to +243
done = action == self.action_terminate_index

if self.reward_function == "estimated_hellinger_distance":
reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty
self.state._layout = self.layout # noqa: SLF001
return create_feature_dict(self.state), reward_val, done, False, {}

This comment was marked as resolved.

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
new_val, new_kind = self.calculate_reward(mode="auto")
delta_reward = new_val - self.prev_reward

if self.prev_reward_kind != new_kind:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we apply a no-effect penalty here? Doesn't the change of the reward_kind imply that the action DID have an effect?

Copy link
Copy Markdown
Collaborator

@flowerthrower flowerthrower Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was marked unresolved but is still the case? @Shaobo-Zhou

Comment thread src/mqt/predictor/rl/predictorenv.py
@flowerthrower
Copy link
Copy Markdown
Collaborator

flowerthrower commented Feb 24, 2026

@coderabbitai provide a detailed summary of the structural and methodical changes in this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 24, 2026

Tip

For best results, initiate chat on the files or code changes.

@flowerthrower: Here's a detailed breakdown of the structural and methodical changes in PR #526:

🏗️ Structural Changes

New Module

src/mqt/predictor/rl/approx_reward.py (116 lines added)

  • Introduces heuristic reward estimation for non-native/unmapped circuits
  • Public functions added:
    • get_basis_gates_from_target(): Extracts basis gates from a Qiskit Target
    • estimate_basis_gate_counts(): Transpiles circuits to basis gates and counts occurrences
    • approx_expected_fidelity(): Computes product-based fidelity estimate using per-gate error rates
    • approx_estimated_success_probability(): Adds idle/decoherence penalties based on circuit duration and coherence time

Core Environment Refactoring (src/mqt/predictor/rl/predictorenv.py)

Constructor changes:

  • Added parameters: reward_scale: float = 1.0, no_effect_penalty: float = -0.001

New public methods:

  • calculate_reward(qc, mode="auto") -> tuple[float, str]: Central reward computation with three modes:
    • "exact": Uses calibration-aware metrics (expected_fidelity/ESP)
    • "approx": Uses conservative canonical cost model from new approx_reward module
    • "auto": Automatically selects exact vs. approx based on circuit native-ness and mapping status
  • _ensure_device_averages_cached(): Caches per-gate error rates, durations, and coherence scale (_tbar) for fast approximate evaluation

Signature changes:

  • _handle_qiskit_layout_postprocessing(): Now accepts PropertySet dict instead of PassManager for layout propagation

Internal flow updates:

  • New _apply_and_update() helper consolidates action application, decomposition, state update, and validation
  • Reward computation now tracks prev_reward and prev_reward_kind to enable delta-based reward shaping
  • Handles regime transitions (exact ↔ approx) gracefully during episode

Type System Changes (src/mqt/predictor/rl/actions.py)

  • DeviceDependentAction.do_while: Callback type changed from dict[str, Circuit] to PropertySet

Supporting Changes

  • src/mqt/predictor/reward.py: Added fallback dt = device.dt or 1.0 for devices without timing info
  • src/mqt/predictor/utils.py: Replaced Windows RuntimeWarning with logger.info for timeout unsupported message
  • src/mqt/predictor/ml/predictor.py: Whitespace cleanup (removed empty line)

🧠 Methodical Changes

From Terminal to Stepwise Rewards

Previous behavior:

  • As noted in the retrieved learning, reward functions like estimated_hellinger_distance only computed exact rewards at episode termination
  • Intermediate steps received no_effect_penalty, providing no shaping signal

New behavior:

  • Delta-based reward shaping: reward = (current_metric - prev_metric) * reward_scale + no_effect_penalty
  • Terminal step uses exact final evaluation
  • Enables gradient information throughout the episode

Two-Regime Reward System

Exact Regime (native + mapped circuits):

Step reward = Δ(exact metric) × reward_scale + no_effect_penalty
  • Applies to circuits using only device-native gates respecting coupling map
  • Verified via internal _is_native_and_mapped() check (uses GatesInBasis and CheckMap passes)
  • Metrics: calibration-aware expected_fidelity or estimated_success_probability

Approximate Regime (non-native/unmapped circuits):

Step reward = Δ(approx metric) × reward_scale + no_effect_penalty
  • Conservative estimates for circuits with non-native gates or topology violations
  • Approximations:
    • Expected fidelity: ∏ᵍ (1 - pₑᵣᵣ)^count(g) over basis gates
    • ESP: Gate fidelity term × exp(-circuit_duration / tbar) coherence decay

Auto mode: Dynamically selects regime based on circuit state

Optimization: Calibration Caching

  • Device error rates, gate durations, and coherence time (_tbar) are computed once and cached
  • Prevents redundant Target API queries during episode
  • Raises informative errors if calibration data is missing

🧪 Test Updates

New test (tests/compilation/test_predictor_rl.py):

  • test_approx_reward_paths_use_cached_per_gate_maps(): Validates approximate reward path, cache population, and output bounds [0,1]

Test configuration changes:

  • Replaced BenchmarkLevel.ALGBenchmarkLevel.INDEP across multiple test files
  • Added mapping validation using CheckMap pass in test_qcompile_with_newly_trained_models
  • Increased timeout from 600 → 6000 in Hellinger distance tests

📝 Documentation

  • CHANGELOG.md: Added "✨ Improve RL reward design by adding intermediate rewards" under Unreleased/Changed
  • Removed support notes for custom model names and x86 macOS

This PR fundamentally shifts the RL training paradigm from sparse terminal rewards to dense intermediate feedback, enabling the agent to learn from every compilation step. The two-regime design maintains exactness when possible while providing conservative estimates when necessary, balancing computational efficiency with training signal quality.


🧠 Learnings used
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/predictor PR: 526
File: src/mqt/predictor/rl/predictorenv.py:238-243
Timestamp: 2026-02-09T13:15:55.306Z
Learning: When using reward_function == "estimated_hellinger_distance" in PredictorEnv (src/mqt/predictor/rl/predictorenv.py), the current implementation only computes the exact reward at episode termination and assigns no_effect_penalty to all non-terminal steps, which prevents intermediate shaping signal. This is likely unintentional; the Hellinger distance can be computed exactly at each step and used for delta-based reward shaping, similar to expected_fidelity and estimated_success_probability.
</add_learning>

@flowerthrower flowerthrower changed the title Add stepwise intermediate reward for RL ✨ Add stepwise intermediate reward for RL Feb 24, 2026
@flowerthrower flowerthrower changed the base branch from main to qce-experiments March 11, 2026 08:15
@mergify
Copy link
Copy Markdown

mergify Bot commented Mar 11, 2026

🧪 CI Insights

Here's what we observed from your CI run for 6e5e9b3.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Copy Markdown
Collaborator

@flowerthrower flowerthrower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Shaobo, I didn't quite finish the review yet but wanted to share some initial thoughts already. No hurry, but it would be nice to either address these issues or discuss them next week. Anyway I am happy to see this progressing already!👍

Comment thread docs/setup.md Outdated
Comment thread pyproject.toml Outdated
Comment thread src/mqt/predictor/reward.py Outdated
Comment thread src/mqt/predictor/reward.py Outdated
Comment thread src/mqt/predictor/rl/approx_reward.py Outdated
Comment thread src/mqt/predictor/rl/approx_reward.py
Comment thread src/mqt/predictor/rl/approx_reward.py
Comment thread src/mqt/predictor/rl/predictor.py
Comment thread tests/compilation/test_predictor_rl.py
Comment thread tests/compilation/test_predictor_rl.py
Comment thread src/mqt/predictor/rl/predictorenv.py
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
new_val, new_kind = self.calculate_reward(mode="auto")
delta_reward = new_val - self.prev_reward

if self.prev_reward_kind != new_kind:
Copy link
Copy Markdown
Collaborator

@flowerthrower flowerthrower Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was marked unresolved but is still the case? @Shaobo-Zhou

Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
@flowerthrower flowerthrower merged commit f628b73 into munich-quantum-toolkit:qce-experiments Mar 18, 2026
14 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Apr 13, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants