The compute_loss function is wrong for the Simplest Policy Gradient #414

alantpetrescu · 2024-06-05T07:54:08Z

I have been reading the 3 parts from "Introduction to RL" section and I have observed in part 3 that the compute_loss function for the Simplest Policy Gradient returns the mean of the product between the log probabilities of the actions taken by the agent and the weights of those actions, in other words, the finite-horizon undiscounted returns of the episodes in which they were taken.

In the estimation of the Basic Gradient Policy above, the sums of products is divided by the number of trajectories, but in the implementation, when you return the mean, the sums of products is divided by the number of all the actions taken across all the trajectories from one epoch. Maybe I am understanding this wrong, but I wanted to get a clear picture on the implementation.

hirodeng · 2024-06-28T02:20:45Z

I notice the same problem

earnesdm · 2024-08-25T20:52:09Z

@alantpetrescu I think you are correct that the equation written differs from what is implemented in code, but only by a constant multiple. Since we multiply the gradient estimate by the learning rate when performing gradient ascent, the constant multiple doesn't really matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The compute_loss function is wrong for the Simplest Policy Gradient #414

The compute_loss function is wrong for the Simplest Policy Gradient #414

alantpetrescu commented Jun 5, 2024 •

edited

Loading

hirodeng commented Jun 28, 2024

earnesdm commented Aug 25, 2024 •

edited

Loading

The compute_loss function is wrong for the Simplest Policy Gradient #414

The compute_loss function is wrong for the Simplest Policy Gradient #414

Comments

alantpetrescu commented Jun 5, 2024 • edited Loading

hirodeng commented Jun 28, 2024

earnesdm commented Aug 25, 2024 • edited Loading

alantpetrescu commented Jun 5, 2024 •

edited

Loading

earnesdm commented Aug 25, 2024 •

edited

Loading