It uses two networks, each with indipendent noise, with parameter
Instead of
Use current network to evaluate action and target network to evaluate the value. The target network is updates only every N steps
It uses two networks, each with indipendent noise, with parameter
Instead of
Use current network to evaluate action and target network to evaluate the value. The target network is updates only every N steps