Function Approximation

When the state-action space becomes very large, we face two problems: First, we can not use tabular methods anymore, since we can not store all values. Second and more important, even if we could store all the values, the probability of visiting all state-action pairs with the above algorithms becomes increasingly unlikely, in other words most states will never be visited during training. Ideally, we should thus identify states that are ‘similar’, assign them ‘similar’ value, and choose ‘similar’ actions when in these states. This grouping of similar states is exactly the kind of generalization we tried to achieve in the previous sections. Not surprisingly, reinforcement learning is most successful when combined with neural networks.

In particular, we can parametrize a value function \(\hat{v}_\pi(s; \theta)\) and try to find parameters \(\theta\) such that \(\hat{v}_\pi(s; \theta) \approx v_\pi(s)\). This approximation can be done using the supervised-learning methods encountered in the previous sections, where the target, or label, is given by the new estimate. In particular, we can use the mean squared value error to formulate a gradient descent method for an update procedure analogous to Eq. (57). Starting from a state \(S\) and choosing an action \(A\) according to the policy \(\pi(a|S)\), we update the parameters

\[\theta^{[k+1]} = \theta^{[k]} + \alpha [R +\gamma \hat{v}_\pi(S'; \theta^{[k]}) - \hat{v}_\pi(S; \theta^{[k]}) ] \nabla \hat{v}_\pi (S;\theta^{[k]})\]

with \(0< \alpha < 1\) again the learning rate. Note that, even though the new estimate also depends on \(\theta^{[k]}\), we only take the derivative with respect to the old estimate. This method is thus referred to as semi-gradient method. In an similar fashion, we can reformulate the Sarsa algorithm introduced for generalized gradient iteration.