Reinforcement learning

Reinforcement learning and learning -enhancing (English reinforcement learning ) is the umbrella term for a number of machine learning methods in which an agent determines the utility of action sequences in a world. For this purpose, used Reinforcement learning the theory of Markov decision problems ( engl. Markov Decision Processes (MDP ) ). In concrete, stands behind the attempt to distributed rewards to be distributed to an agent over the previous actions that the agent knows the utility of each action and can exploit.

Introduction

Consider a dynamic system - consisting of an agent and its environment ( the world) - in discrete time steps. At any time, the world is in a state and the agent chooses an action. Then the system goes to the state and the agent receives a reward.

Expected profit

The objective is the expected profit ( engl. expected return)

To maximize. The expected profit is thus something like the expected total reward. This is called the discount factor (german discount factor). In episodic problems, that is, the world is going after a finite number of steps in a final state (such as a chess game ), the discount factor is. In this case, each reward is considered equal. For continuous problems ( ) you have to choose one, so that the infinite series converges. For one, only the current reward; all future rewards will be ignored. Go to 1, the agent is farsighted.

Strategies

In reinforcement learning, the agent follows a strategy (English policy). Typically, the strategy is considered to be a function that assigns to each state an action. However, non -deterministic strategies (or mixed strategies ) are possible, so that an action with a certain probability is selected. In general, a strategy is therefore defined as a conditional probability distribution.

Markov decision process

Reinforcement learning is often referred to as Markov decision process ( engl. Markov Decision Process) construed. Characteristic is the assumption that the Markov property is satisfied:

Key terms of a Markov decision process are the action model ( or transition probability) and the expected reward in the next time step ( engl. expected reward). The action model is a conditional probability distribution of the global state to state, if the agent has selected the action. In the deterministic case, the action model is simply a function that maps a state-action pair a new state. The expected reward is defined as follows

Approximation

In infinite state spaces, this utility function has to be approximated, eg neural networks or Gaussian processes.

Simultaneous learning of multiple agents

Should learn more than one agent can be guaranteed even in cooperative agent, except in trivial cases, the convergence of learning processes (so far) no more. With the help of heuristics often a useful in practice behavior can be learned Nevertheless, as the worst case occurs rarely.

Scholarpedia

120413