# Compatible reward inverse reinforcement learning

Alberto Maria Metelli

Politecnico di Milano, 2017.

Abstract
Sequential decision making problems arise in a variety of areas in Artificial Intelligence. Reinforcement Learning proposes a number of algorithms able to learn an optimal behavior by interacting with the environment. The major assumption is that the learning agent receives a reward as soon as an action is performed. However, there are several application domains in which a reward function is not available and difficult to estimate, but samples of expert agents playing an optimal policy are simple to generate. Inverse Reinforcement Learning (IRL) is an effective approach to recover a reward function that explains the behavior of an expert by observing a set of demonstrations. Most of the classic IRL methods, in addition to expert's demonstrations, require sampling the environment in order to compute the optimal policy for each candidate reward function. Furthermore, in most of the cases, it is necessary to specify a priori a set of engineered features that the algorithms combine to single out the reward function. This thesis is about a novel model-free IRL approach that, differently from most of the existing IRL algorithms, does not require to specify a function space where to search for the expert's reward function. Leveraging on the fact that the policy gradient needs to be zero for any optimal policy, the algorithm generates a set of basis functions that span the subspace of reward functions that make the policy gradient vanish. Within this subspace, using a second-order criterion, we search for the reward function that penalizes the most a deviation from the expert's policy. After introducing our approach for finite domains, we extend it to continuous ones. The proposed approach is compared to state-of-the-art IRL methods both in the (finite) Taxi domain and in the (continuous) Linear Quadratic Gaussian Regulator and Car on the Hill environments. The empirical results show that the reward function recovered by our algorithm allows learning policies that outperform both behavioral cloning and those obtained with the true reward function, in terms of learning speed.

 @mastersthesis{mastersthesis,
}