The RL agent currently only learns from terminal rewards. Intermediate rewards lead to more _efficient_ policies. Mostly implemented in #526
The RL agent currently only learns from terminal rewards. Intermediate rewards lead to more efficient policies.
Mostly implemented in #526