Trust Region Policy Optimization (TRPO)

In TRPO, a “surrogate” objective function is maximized subject to a constraint on the size of the policy update. $latex \max_{\theta} \hat{\mathbb{E}}_t \left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t \right]$ subject to the constraint $latex \hat{\mathbb{E}}_t[ KL[\pi_{\theta_{old}}(\cdot|s_t), \pi_{\theta}(\cdot|s_t)]]< \delta $ This problem can be efficiently  approximately Continue reading Trust Region Policy Optimization (TRPO)