Skip to content

action distribution for estimating V #8

@immars

Description

@immars

Hi,
First of all, thanks for this inspiring work!

In

next_value = tf.reduce_logsumexp(q_value_targets, axis=1)

it seems to me that action is sampled from uniform distribution when estimating V_{soft} .

In Sec. 3.2. of your original paper, it is stated that

For q_a we have more options. A
convenient choice is a uniform distribution. However, this
choice can scale poorly to high dimensions. A better choice
is to use the current policy, which produces an unbiased
estimate of the soft value as can be confirmed by substi-
tution.

have you experimented with sampling from current policy to estimate V? Or, how good does uniform distribution do in practice, especially in higher dimensional cases?

thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions