Policies and Value Functions¶
A policy \(\pi(a | s)\) is the probability of choosing the action \(a\) when in state \(s\). We can thus formulate our learning task as finding the policy that maximizes our reward and reinforcement learning as adapting an agent’s policy as a result of its experience. For a given policy, we can define the value function of a state \(s\) as the expected return from starting in that state and using the policy function \(\pi\) for choosing all our future actions. We can write this as
Alternatively, we can define the action-value function of \(\pi\) as
This is the expectation value for the return starting in state \(s\) and choosing action \(a\), but using the policy \(\pi\) for all future actions. Note that one of the key ideas of reinforcement learning is to use such value functions, instead of the policy, to organize our learning process.
The value function of Eq. (51) satisfies a self-consistency equation,
This equation, known as the Bellman equation, relates the value of state \(s\) to the expected reward and the (discounted) value of the next state after having chosen an action under the policy \(\pi(a|s)\).
As an example, we can write the Bellman equation for the strategy of always leaving the lamps on in our toy model. Then, we find the system of linear equations
from which we can solve easily for \(v_{\rm on}({\rm h})\) and \(v_{\rm on}({\rm l})\).
Instead of calculating the value function for all possible policies, we can directly try and find the optimal policy \(\pi_*\), for which \(v_{\pi_*}(s) > v_{\pi'}(s)\) for all policies \(\pi'\) and \(s\in\mathcal{S}\). For this policy, we find the Bellman optimality equations
Importantly, the Bellman optimality equations do not depend on the actual policy anymore. As such, Eq. (53) defines a non-linear system of equations, which for a sufficiently simple MDP can be solved explicitly. For our toy example, the two equations for the value functions are
and
Note that equivalent equations to Eqs. (53) hold for the state-action value function
Once we know \(v_*\), the optimal policy \(\pi_* (a| s)\) is the greedy policy that chooses the action \(a\) that maximizes the right-hand side of Eq. (53). If, instead, we know \(q_*(s,a)\), then we can directly choose the action which maximizes \(q_*(s,a)\), namely \(\pi_*(a | s) = {\rm argmax}_{a'} q_*(s, a')\), without looking one step ahead.
While Eqs (53) or (54) can be solved explicitly for a sufficiently simple system, such an approach, which corresponds to an exhaustive search, is often not feasible. In the following, we distinguish two levels of complexity: First, if the explicit solution is too hard, but we can still keep track of all possible value functions—we can choose either the state or the state-action value function—we can use a tabular approach. A main difficulty in this case is the evaluation of a policy, or prediction, which is needed to improve on the policy. While various methods for policy evaluation and policy improvement exist, we will discuss in the following an approach called temporal-difference learning. Second, in many cases the space of possible states is much too large to allow for a complete knowledge of all value functions. In this case, we additionally need to approximate the value functions. For this purpose, we can use the methods encountered in the previous chapters, such as (deep) neural networks.