Online Learning in Non-Cooperative Configurable Markov Decision Process

Giorgia Ramponi, Alberto Maria Metelli, Alessandro Concetti, and Marcello Restelli

AAAI-21 Workshop on Reinforcement Learning in Games, 2021.

In the Configurable Markov Decision Processes there are two entities, a Reinforcement Learning agent and a configurator which can modify some parameters of the environment to improve the performance of the agent. What if the configurator does not have the same intentions as the agent? In this paper, we introduce the Non-Cooperative Configurable Markov Decision Process, a framework that allows having two (possibly different) reward functions for the configurator and for the agent. In this setting, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose a learning algorithm to minimize the configurator expected regret, which exploits the structure of the problem. While a naïve application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically show the performance of our algorithm in simulated domains.

[Link] [BibTeX]

    author = "Ramponi, Giorgia and Metelli, Alberto Maria and Concetti, Alessandro and Restelli, Marcello",
    title = "Online Learning in Non-Cooperative Configurable Markov Decision Process",
    journal = "AAAI-21 Workshop on Reinforcement Learning in Games",
    year = "2021",
    url = "\_paper\_7.pdf"