Hello,
I have an MDP model where I need the scheduler to generate the optimal policy while maximizing rewards. I have a PRISM file with modules for transitions and for rewards (module "all_rewards").
When I run this .pm file and property in PRISM and in Storm CLI, both run smoothly and produce the same policy. However, when I run it in StormPy, I get a totally wrong set of policies.... It seems like the model is not reading my rewards correctly or is not actually maximizing them. I am not sure which setting I might have gotten wrong. Below is my current StormPy script:
import stormpy
Input
prism_file = "5_updated.pm"
formula_str = 'R{"all_rewards"}max=? [ F "termination" ]'
Parse model and property
prism_program = stormpy.parse_prism_program(prism_file)
properties = stormpy.parse_properties(formula_str, prism_program)
Builder options
options = stormpy.BuilderOptions()
options.set_build_state_valuations(True)
options.set_build_choice_labels(True)
options.set_build_with_choice_origins(True)
Build and solve
model = stormpy.build_sparse_model_with_options(prism_program, options)
result = stormpy.model_checking(model, properties[0], extract_scheduler=True)
scheduler = result.scheduler
Export policy
for state in model.states:
choice = scheduler.get_choice(state)
action_index = choice.get_deterministic_choice()
action = state.actions[action_index]
f"state {state.id}: {state.valuations} => {action.labels}\n"
Could you please let me know if there is any setting I might be missing that ensures the rewards are correctly read and applied during scheduler synthesis, and if I am following the correct steps to extract the scheduler for this type of reward-maximization property?
Thank you for your time and guidance.
Best regards,
Wendy
Hello,
I have an MDP model where I need the scheduler to generate the optimal policy while maximizing rewards. I have a PRISM file with modules for transitions and for rewards (module "all_rewards").
When I run this .pm file and property in PRISM and in Storm CLI, both run smoothly and produce the same policy. However, when I run it in StormPy, I get a totally wrong set of policies.... It seems like the model is not reading my rewards correctly or is not actually maximizing them. I am not sure which setting I might have gotten wrong. Below is my current StormPy script:
Could you please let me know if there is any setting I might be missing that ensures the rewards are correctly read and applied during scheduler synthesis, and if I am following the correct steps to extract the scheduler for this type of reward-maximization property?
Thank you for your time and guidance.
Best regards,
Wendy