METHODOLOGY AND EXPERIMENTS

The Math:

The problem of recovering the original reward function from expert demonstrations is known as inverse reinforcement learning (IRL). Following Finn et. al. (2016), we formulate this problem as a special case of the Generative Adversarial Network (GAN) framework (Goodfellow et. al., 2014), where a generator network is trained to maximize the probability of fooling a discriminator network which attempts to classify its inputs as being from the real world or from the generator. In this parameterization, we have:

Screen Shot 2018-05-10 at 11.18.34 PM.png

This allows for an intuitive interpretation of learning a reward function which produces trajectories that are indistinguishable from the experts’ when a policy is trained on the reward. Fu et. al. (2018) then show theoretical equivalence and empirical improvement by training the discriminator and policy on (s, a, s_0) tuples rather than entire trajectories τ . They also present results showing that reparameterizing the reward as Rω(s, a, s0) := rφ(s) + γVψ(s0) − Vψ(s) allows empirically for recovery of an action-independent reward, enabling transfer under changes in dynamics, as mentioned in the “Motivations” section.

We combine this work with that of Hausman et. al. (2017) to enable the decomposition of expert demonstrations into subtasks. This is accomplished by allowing the policy to accept an “intention” i as an input. We further extend their work by additionally introducing a controller network cη(i|s) which chooses an intention given the current state. The generator loss function then becomes

where pi is a neural network which attempts to infer the intention i from an (s, a) pair and is trained with cross-entropy loss. lambda_l and lambda_C are hyperparameters: the lambda_l term encourages the policy to produce distinguishable actions for each intention, and the lambda_C term encourages the controller network to use each intention an equal amount.

The Task:

The physical task was to have a Turtlebot equiped with a velocity controller and augmented with two parallel bars transport a box to a target location.

In order to train our network, we developed an OpenAI gym environment to simulate the physics of the scenario. The action space was the linear and angular velocity of the Turtlebot, while the observation space was the coordinates of the Turtlebot, box, and target.

We collected expert demonstrations with four distinct subtasks in mind:

Moving to a location such that the flat face of the box will be parallel to the rulers
Turning from this point to face the box
Moving forward until the Turtlebot touches the box
Transporting the box to the target location

These subtasks are illustrated in the diagram to the right, which is taken from our simulator: