Optimistic Policy Optimization via Multiple Importance Sampling

Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli

Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.

Acceptance rate: 773/3424 (22.6%)
CORE 2018: A*   GGS 2018: A++

Abstract
Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

[Link] [Poster] [Slides] [Code] [Talk] [BibTeX]

 @inproceedings{papini2019optimistic,
    author = "Papini, Matteo and Metelli, Alberto Maria and Lupo, Lorenzo and Restelli, Marcello",
    title = "Optimistic Policy Optimization via Multiple Importance Sampling",
    booktitle = "Proceedings of the 36th International Conference on Machine Learning ({ICML})",
    volume = "97",
    pages = "4989--4999",
    year = "2019",
    publisher = "PMLR",
    url = "http://proceedings.mlr.press/v97/papini19a.html"
}