action distribution for estimating V

Hi,
First of all, thanks for this inspiring work!

In 
https://github.com/haarnoja/softqlearning/blob/59c0bbb7d665616f796ab101de65227c89ffd318/softqlearning/algorithms/sql.py#L164

it seems to me that action is sampled from uniform distribution when estimating `V_{soft}` .

In Sec. 3.2. of your original paper, it is stated that
```
For q_a we have more options. A
convenient choice is a uniform distribution. However, this
choice can scale poorly to high dimensions. A better choice
is to use the current policy, which produces an unbiased
estimate of the soft value as can be confirmed by substi-
tution.
```

have you experimented with sampling from current policy to estimate `V`? Or, how good does uniform distribution do in practice, especially in higher dimensional cases?

thanks,


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

action distribution for estimating V #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

action distribution for estimating V #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions