6.10 History
Most of the fundamental concepts underlying modern Actor-Critic algorithms were outlined in 1983 in Sutton, Barto, and Anderson’s paper “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems” [11]. This paper introduced the idea of jointly learning two interacting modules: a policy unit or actor, referred to in this paper as an “Associative Search Element” or ASE, and a critic, named an “Adaptive Critic Element” or ACE. The algorithm was motivated by neuroscience which is a field that has often inspired deep learning research. For example, CNNs were originally inspired by the human visual cortex [43, 71]. Sutton et al. argue that modeling a structure as complex as a neuron at the level of a logic-gate-like unit is inadequate. They proposed the ASE/ACE as an initial exploration into systems that consist of a network of complex learned sub-units. Their hope was that shifting from simple to complex primitive units would help solve significantly more complex problems.
It is likely that the name ACE was inspired by Widrow et al. [146] who ten years earlier, in 1973, used the phrase “learning with a critic” to distinguish RL from supervised learning (SL) which can be thought of as “learning with a teacher” [11].8 However, applying a learned critic to improve the reinforcing signal given to another part of the system was novel. Sutton et al. explain the ACE’s purpose very clearly:
It [the ACE] adaptively develops an evaluation function that is more informative than the one directly available from the learning system’s environment. This reduces the uncertainty under which the ASE will learn [11].9
Credit assignment and sparse reward signals are two of the most important challenges in RL. The ACE tackles the problem of credit assignment in RL by using the critic to transform a potentially sparse reward into a more informative, dense reinforcing signal. A significant amount of research since this paper has been focused on developing better critics.
The advantage function was first mentioned in 1983 by Baird [9]. However, it was in 1999 that Sutton et al. [133] defined the advantage as Aπ(s, a) = Qπ(s, a) − Vπ(s).
There was a surge of interest in Actor-Critic algorithms after the publication of the Asynchronous Advantage Actor-Critic (A3C) [87] algorithm by Mnih et al. in 2016. This paper introduced a simple scalable method for parallelizing the training of reinforcement learning algorithms using asynchronous gradient ascent. We will look at this in Chapter 8.