Hi,
First of all, thanks for this inspiring work!
In
|
next_value = tf.reduce_logsumexp(q_value_targets, axis=1) |
it seems to me that action is sampled from uniform distribution when estimating V_{soft} .
In Sec. 3.2. of your original paper, it is stated that
For q_a we have more options. A
convenient choice is a uniform distribution. However, this
choice can scale poorly to high dimensions. A better choice
is to use the current policy, which produces an unbiased
estimate of the soft value as can be confirmed by substi-
tution.
have you experimented with sampling from current policy to estimate V? Or, how good does uniform distribution do in practice, especially in higher dimensional cases?
thanks,
Hi,
First of all, thanks for this inspiring work!
In
softqlearning/softqlearning/algorithms/sql.py
Line 164 in 59c0bbb
it seems to me that action is sampled from uniform distribution when estimating
V_{soft}.In Sec. 3.2. of your original paper, it is stated that
have you experimented with sampling from current policy to estimate
V? Or, how good does uniform distribution do in practice, especially in higher dimensional cases?thanks,