Optimistic Policy Optimization via Multiple Importance Sampling

Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli

Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019. Acceptance rate: 773/3424 (22.6%)

Abstract
Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

[Paper] [Poster] [Slides] [Code] [BibTeX]

 @inproceedings{papini2019optimistic,
author = "Papini, Matteo and Metelli, Alberto Maria and Lupo, Lorenzo and Restelli, Marcello",
editor = "Chaudhuri, Kamalika and Salakhutdinov, Ruslan",
title = "Optimistic Policy Optimization via Multiple Importance Sampling",
booktitle = "Proceedings of the 36th International Conference on Machine Learning, {ICML} 2019, 9-15 June 2019, Long Beach, California, {USA}",
series = "Proceedings of Machine Learning Research",
volume = "97",
pages = "4989--4999",
publisher = "{PMLR}",
year = "2019",
url = "http://proceedings.mlr.press/v97/papini19a.html"
}