MOTIVATION AND RELATED THEORY

Abstract

Many robotics tasks in industrial or transportation fields are modular and share similar subgoals. For example, traveling from point A to point B might involve turning left, performing a U-turn, and going straight, while traveling from point C to point D might involve going straight, turning right, going straight, and turning left. In these situations, rather than learning separate optimal policies for each supertask, it would be more efficient to learn the policies for the component subtasks and subsequently combine them as needed. Thus, for a particular supertask we can extract useful training information from all expert demonstrations that contain similar subtasks, even if those demonstrations are for a different supertask. In addition, we can create new supertasks with no new training required simply by combining existing subtasks in new ways. This is accomplished with a discriminator network which attempts to minimize the difference between the expert policy and the learned policy given a certain intention (where the term intention is interchangeable with subtask), and also make intentions as distinguishable as possible. This is a useful result because the IntentionGAN system has never been implemented before in a physical environment.

MOTIVATION

Why not classical reinforcement learning?

In reinforcement learning, we consider an agent acting in an environment typically modeled as a Markov decision process (MDP) defined by the tuple (S, A, T, R, γ, ρ), where S, A are the state and action spaces, respectively, and γ is a discount factor. The transition probabilities (or dynamics) T = p(s 0 |s, a), reward function R(s, a), and initial state distribution ρ are unknown to the agent and must be queried by interacting with the environment. The agent then acts according to a learned policy πθ(a|s) which tries to maximize the expected returns Eτ∼πθ [ΣT t=1γ tR(st, at)], where τ := (s0, a0, s1, a1, ..., sT , aT ). A number of algorithms, including REINFORCE (Williams, 1992) and PPO (Schulman et. al., 2017) have been developed over the years in order to effectively learn the policy. However, there are a number of problems with this method. Unless the reward is carefully specified, the policy can be prone to ”reward hacking,” where it does not accomplish the intended task. The classic example of this is a vacuuming robot which receives reward for collecting dirt that performs ”reward hacking” by repeatedly picking up dirt and dumping it back out. This can be remedied by specifying a binary reward of 1 for task completion and 0 otherwise; however, this extremely sparse reward means that the agent does not know it is making progress unless it actually completes the task, which could be very unlikely if an extended series of precise actions is needed to perform the task. Additionally, the binary reward itself may be difficult to compute for complex tasks, such as tying a knot or folding clothes.

Imitation Learning

For these reasons, reinforcement learning has had limited success on real-world tasks. In order to remedy this, we instead specify the task via expert demonstrations, and attempt to imitate the expert’s behavior. However, we differ from standard behavior cloning techniques, such as DAGGER (Ross et. al., 2011), in two key areas. Firstly, we attempt to recover the reward function for the original task. This provides an advantage over standard behavior cloning in that we can recover the intended expert behavior if the dynamics T change or our agent is unable to take the same actions as the expert (for example, if the expert’s linkages have different lengths than the agent’s). In both of the above cases, simply training πθ(a|s) to match (s, a) ∼ Dexpert is infeasible. Secondly, we attempt to decompose the expert trajectory into n modular subtasks, which are collectively capable of matching the expert’s behavior yet are distinguishable from each other. This then allows us to solve tasks composed of any combination of these subtasks.

This video clearly demonstrates reward hacking. Though the game intends for the user to follow the path of yellow dots, the player ultimately gains more points by crashing into the green trees, ultimately getting stuck there.