Firstly, by analysing its biological function, I show that there is a best possible form of cognition, which gives greatest fitness. The results of this optimal cognition can be calculated in any domain -- such as foraging, navigation, vision, or conditioning. Secondly, internal representations and symbols emerge as a feature of near-optimal cognition. This gives a biological basis for cognitivism. Finally, we do not expect brains to evolve to the precise optimum. However, there are indications that animal cognition usually comes close to optimum efficiency, while today's cognitive models (such as neural nets or reinforcement learning models) have a wide range of efficiencies.
1. Much work in cognitive science has been solution-driven, starting from some computational mechanism (such as symbol processing or neural nets) and asking what phenomena it can account for. This paper starts at the other end, looking not at mechanisms, but at the biological requirement which brains meet. Analysing this requirement, I show that for any species (in a given habitat with given sense organs and capabilities) there is a best possible brain, which would help its owner survive better than any other. The performance of the optimal brain is defined by an equation (the Requirement Equation, abbreviated as RE), whose consequences can be calculated in any domain of cognition. The RE analysis applies in the presence of sensory noise, limited sense data, uncertainty or deterministic laws of the domain.
2. Just because there is an optimum, we do not expect animal brains always to attain it. The optimum is habitat-dependent, and so is a moving target; cognitive evolution can only chase this target at limited speed. I discuss from computational, evolutionary and empirical viewpoints how close animal brains come to the optimum. Tentatively, I conclude that they come rather close. So optimal cognition (the requirement equation) gives a calculable yardstick to measure any proposed model of cognition, and a defined target for models to aim at. The requirement analysis is a powerful complement to the mechanism-based approaches to cognition.
3. This section presents the Requirement Equation which defines, under fairly general conditions, the best a brain can do to ensure its owner's survival. It gives an upper limit to how good a brain can be (for an animal with given habitat and sense organs) and states what the brain must do to be that good.
4. We use a probabilistic model of situated action to find the optimum form of cognition. An animal's cognitive system helps it deal with certain classes of external states of affairs, "S". We may "slice reality" in one of several different ways to define the states; we may choose a class of states S which defines the food situation, ignoring all other details (S = "hungry", "replete"; for this the state of the animal's body may be regarded as "external" to its cognitive system); or a class which defines the predator situation, the social situation, or combinations of these, depending on the problem. We approximate the animal's life as a succession of discrete, independent encounters. Intelligence helps to maximise the chances of surviving each encounter. During any encounter, just one of the possible states of affairs S holds, with probability P(S). Sums over states are denoted by sigma s; so sigma s P(S) = 1. A state may be described by many variables, and so the sets of possible states may be very large.
5. The duration of encounters is chosen according to the aspect of intelligence which we are analysing. For some domains (particularly for learning cause-effect relations) long encounters are appropriate; then S may describe regularities which hold over an extended interval, such as "red berries are poisonous"; alternatively S may describe the situation at one moment, such as "predator over there". For any particular analysis, a single choice of duration for all states needs to be made.
6. The animal does not know directly which state S holds; it has some sense data, D, caused by the state. D denotes all sense data, of any modality, over the duration of the encounter, and may be described by many variables. The causal relation between a state S and resulting sense data D is described by a conditional probability P(D|S). Since some sense data must be observed, sigma d P(D|S) = 1.
7. Based on its sense data, the animal chooses to take some action A. Typical sets of possible actions are "go out for food/stay in", "fight/flee", etc. Depending on the state S and the action A, there is an outcome O; typical outcomes are "found food", or "caught by predator". The causal link between actions, states and outcomes is described by a conditional probability P(O|S,A).
8. Each outcome has a value V(O) to the animal. Values may be measured in different currencies such as net energy intake, probability of finding a mate, etc. All currencies are convertible, to a contribution to the animal's Darwinian (or more generally, inclusive) fitness. Currencies are like those used in evolutionary game theory (Maynard Smith, 1982) and foraging theory (Stephens & Krebs, 1986).
9. The animal needs to choose an action A to maximise the average value V of the possible outcomes O. By maximising these values for many different situations throughout its life, the animal maximises its fitness. The way it chooses an action A, given sense data D, is summarised in a decision function F(A, D). (F cannot depend on the state S, as the animal does not know S.) The animal acts as if, given sense data D, it calculates F(A, D) for each possible action A, and chooses the action with the largest F. These decision functions can describe any relation between sense data and actions; they can describe any input-output relation, to give an external, functional specification of any possible brain. Figure 1 shows the relations between the concepts. Our problem is: what set of decision functions F give best fitness?
| | TIME -->
| F |\
|_______| \ ________ ________
\ | | | |
\| A |\ | V |
________ /|________| \ ________ /|________|
| | / F(A,D) \| | / V(O)
- - - - /| D |/ - - - - /| O |/ - - - - -
________ / |________| / |________|
| |/ P(D|S) / P(O|A, S)
| S |--------->------>--------/
Figure 1: Flow of causation for the probabilistic model of cognition. Time flows from left to right, and arrows flow left-to-right from causes to effects. The horizontal dashed line denotes the interface between the animal and its environment. The boxes denote -- S: state of affairs; F: decision rule; D: sense data; A: action; O: outcome; V: value. Probabilities P are explained in the text. State S causes sense data D; for each possible action A the animal calculates a decision function F(A,D) and chooses the action A with largest F; this gives outcomes O with probability P(O|A,S) and value V(O) to the animal.
10. This framework can be applied to any aspect of cognition, by appropriate choices of states, sense data, actions and outcomes; it can apply to problems of learning, perception or choosing actions. As a simple example, a small mammal has to decide whether to leave its burrow to search for food. There are two states of affairs S, "Predator about" and "No predator about". The sense data D are visual or other clues to the presence of the predator. Actions A are "Go out for food" and "Stay at home". Outcomes O are "Hungry", "Satisfied", or "Dead".
11. The framework involves approximations (e.g., dividing an animal's life into a number of discrete encounters) but most cognitive problems can be formulated so as to make the approximations good ones. In this respect it is similar to the analysis of many other scientific problems. The main class of cognition which it does not cover relates to exploratory behaviour, where information (of value for future decisions) can be gathered, but at a cost -- where tradeoffs between information and other currencies must be made. For these cases, extensions to the framework are required.
12. It can be proved within the framework that the best possible set of decision functions is:
F(A, D) = sigma s P(S) P(D|S) sigma o P(O|S, A) V(O) (1)
For given sense data D, the best action A is that which maximises this F(A,D). No other form of F, or different decision rule for A, can give better fitness. Equation (1) is therefore the requirement for animal cognition -- it defines the optimum form of output which any brain could give. As it derives entirely from the biological requirement for cognition (not from any assumptions about brain mechanisms) I shall call it the Requirement Equation.
13. We can multiply all the F by any positive common factor, and it will not alter the choice of action, so it gives the same fitness. So we can rewrite (1) in two parts :
Q(S) = P(S) P(D|S) / [sigma s' P(S') P(D|S') ] (2)
F(A, D) = sigma s Q(S) sigma o P(O|S, A) V(O) (3)
(2) is Bayes' theorem, applied to find the most likely states S (with largest Q(S)) in the light of the sense data D. P is the prior probability of some state S, and Q is the posterior probability of state S, given the sense data D. Then (3) chooses the best possible action, in the light of the likely states S (those with large Q). For many problems, finding the likely states by (2) is the hard part, and then choosing an action by (3) is comparatively simple.
14. One striking consequence of equation (1) is that there is an optimum form of cognition -- that for a species in a given habitat with given sense organs and choices of action, its brain cannot carry on getting better and better without limit.
15. This goes against an intuition that brains are computers, and we can carry on building better and better computers. We can understand it by analogy with gambling. Brains make decisions under uncertainty; they help an animal to take gambles on survival. Each choice of action is a bet with unknown outcome, and the purpose of a brain is to help the animal make the best bet each time. In many games of chance (e.g., blackjack) there is a best possible bet for each circumstance. If you can find this, and place your money, no more elaborate computation serves any purpose. An animal's brain needs just to calculate the best bet.
16. There are cases where the environmental probabilities in the requirement equation depend on the behaviour of conspecifics -- which in turn are output of the equation. These arise in competitive behaviour, such as the competition for mates. In these cases the optimum solution may be an Evolutionary Stable Strategy (ESS) (Maynard Smith, 1982). The evolution of competitive behaviour can be analysed by the requirement equation.
17. Equation (1) looks, at first sight, to be just an application of Bayes' theorem; you might wonder whether, for some complex domain of cognition, we can get much useful out of it. It turns out that the probability functions which enter into equation (1) often have a great deal of structure, which lead to detailed and striking predictions for the form of cognition.
18. As a simple example of the requirement equation, consider the behaviour of the desert ant Cataglyphis bicolor, studied by Wehner and Srinivasan (1981). This animal forages up to 100 metres away from its burrow, covering a path which may be much longer -- but when it finds food, runs straight back in the direction of its burrow and stops to find the burrow when the anticipated distance has been travelled, to within a very few metres. Experiments by Wehner & Srinivasan show that this is done not by use of local landmarks near the burrow, but by dead reckoning; the ant can keep track of its overall displacement from its burrow over a long series of movements.
19. Consider a simplified case in which the ant only makes two linear displacements (x1,y1) and (x2,y2) before finding food, using Cartesian coordinates (x,y) where the x axis is the direction of the sun, and measuring distance in metres. To analyse this case, we use Gaussian peak functions g(x,w) = exp(-x*x/2w*w), denoting a function with peak at x = 0 and width w. The space of possible states S has six dimensions (x1,y1,x2,y2,x,y) where (x,y) is the total displacement of the ant from its burrow. The probability P(S) is then a probability density in the six dimensions, P(S(x1,y1,x2,y2,x,y)) = a*g(x1,30)*g(y1,30)*g(x2, 30)*g(y2,30)*g(x1 + x2 - x, e)*g(y1 + y2 - y,e) where e is small and a is a normalisation constant, and denotes that the two displacements (x1,y1) and (x2,y2) have broad Gaussian distributions of width 30 metres, and the total displacement (x,y) is related to them by the geometric constraint x = x1 + x2, y = y1 + y2. Because the constraint is precise, e is small.
20. The ant's sense data D consist of some approximate measures (X1,X2) and (Y1,Y2) of the individual displacements it makes when foraging. If its sensory measurement errors are +-2 metres for each displacement, the probability density P(D|S) = b*g(X1 - x1,2)*g(Y1 - y1,2)*g(X2 - x2,2)*g(Y2 - y2,2).
21. When it finds food after the two displacements, the ant's choice of action A is a choice of a displacement (Xr,Yr) to run home. The outcome O is that it ends up at a final displacement (Xf,Yf) from its burrow. The probability density for these outcomes is P(O|S, A) = c*g(Xf - x - Xr,e)*g(Yf - y - Yr,e) with small e; this sharply peaked probability function reflects the precise geometric constraints Xf = x + Xr, Yf = y + Yr. Finally, the value (to the ant) of this outcome O is highest when it ends up near its burrow (say within 3 metres), so V(O) = d*g(Xf,3)*g(Yf,3). b, c, and d are constants.
22. We now have all the terms to feed into the requirement equation (1). The sums over states S and outcomes O are integrals over the variables (x1, y1, x2, y2, x, y) and (Xf,Yf) respectively, which can be done analytically as they are just integrals of Gaussians. The result is F(A,D) = f*g(X1 + X2 + Xr, d)*g(Y1 + Y2 + Yr,d), where f is a constant and the distance d = 4.2 metres. The best choice of action (Xr, Yr) is then to maximise F with respect to Xr and Yr, i.e., to choose Xr = - X1 - X2, Yr = - Y1 - Y2.
23. This analysis makes no assumptions about the internal mechanisms of the ant's brain; it simply says what result those mechanisms must give if they are to contribute as effectively as possible to the ant's survival.
24. In this case the requirement equation leads to an intuitively plausible answer -- that the best form of cognition for the ant is to keep track of a sum of its two-dimensional displacements from its burrow as it moves, and when it finds food to choose just this vector (reversed) to run home. In other words, the ant needs to maintain an internal representation of its own position relative to its burrow. The requirement equation implies that no other form of cognition will give better average survival for the ant.
25. While the requirement equation (1) has a superficial similarity to the mathematics used in reinforcement learning models (RLM) (Machine Learning, 1992), this example shows how completely different the two approaches are in practice. In RLM, there is no detailed model of the prior probabilities as above, so the ant would initially have very poor performance, only learning approximately to re-find its burrow after very many (typically thousands of) trials -- the very slow improvement typically given by reinforcement learning. In the requirement analysis above, the detailed form of the probabilities P(S), P(D|S), etc., embody the constraints of two-dimensional geometry precisely, so that no learning at all is required; if the ant's brain somehow embodies those constraints, it goes straight for home each time.
26. Reinforcement learning is a general-purpose approach which usually does not embody domain constraints in a very precise form. By contrast, the performance implied by the Requirement Equation precisely embodies the constraints of the domain -- as structures in the probability functions -- to give much higher performance, in fact the best performance possible. Another difference is that while reinforcement learning theory proposes a learning mechanism, the requirement equation does not define a mechanism, but states what performance a mechanism is required to give, to be optimal for the animal. The experimental question is whether animals are like reinforcement learners -- with little domain structure encoded in their brains and using general, slow, learning mechanisms -- or are close to the Requirement Equation, with domain-specific intelligence and near-optimal performance. Wehner's observations of the desert ant imply that it is close to the Requirement Equation optimum, and is not like a Reinforcement Learner.
27. In reinforcement learning, each ant would need to effectively learn the laws of two-dimensional geometry for itself by trial and error; but an ant does not have the luxury of making so many mistakes. The requirement equation describes how the ant behaves if the laws of geometry have been built into its brain by natural selection, of ants who re-found their burrows more reliably than others. Since the laws of geometry have been true for all time, there has been plenty of time for selection to operate and it is not surprising that desert ants are close to the requirement equation, rather than reinforcement learners.
28. As a second example of the requirement analysis, consider associative conditioning, as when a rat learns a tone-shock association. In this analysis, the states S refer to the underlying causal regularities of the rat's environment, and the sense data D refer to the outcomes (tones and shocks) which it experiences in a particular sequence of N trials (which constitute one encounter of the RE analysis).
29. There are two sets of states: S1 (in which there is no causal relation between tone and shock), and S2 (in which the tone causes a shock with probability p). In states S1, the probability of a tone on any trial is t, and the probability of a shock is s, independent of whether a tone has occurred on that trial. In states S2 the probability of a tone is t', but then the probability of a shock is either s' = P(shock|no tone) or p = P(shock|tone). Thus S1 is a continuum of states defined by s and t, while S2 is a continuum of states defined by p, s', and t'.
30. The prior probabilities P(S) in the requirement equation are again probability densities in s, t, etc. To reflect the rat's lack of prior knowledge about the frequencies of tones and shocks, we set P(S1(s,t)) = 1 - e, P(S2(s',t',p)) = e; thus all values of s and t are equally likely, and all values of p, s' and t' are equally likely. The constant e is small, reflecting the small prior probability (in the rat's natural environment) that there is any causal relation between disparate stimuli such as tones and shocks.
31. In the sequence of N trials, the rat experiences (a) trials with tone only, (b) with shock only, (c) with both tone and shock, and (d) with neither tone nor shock, so a + b + c + d = N. These are its sense data D. The probabilities P(D|S) are then given by elementary probability theory:
P(D|S1) = [t**(a+c)][(1-t)**(b+d)][s**(b+c)][(1-s)**(a+d)]
P(D|S2) = [t'**(a+c)][(1-t')**(b+d)][s'**b][(1-s')**d][p**c][(1-p)**a]
32. This time we use the requirement equation in the form of (2) and (3), and concentrate on (2) to analyse the posterior probabilities Q(S1) and Q(S2), not being so much concerned with the rat's choices of action or the values of those actions in (3). If Q(S2) is much larger than Q(S1), the rat "believes" in a tone-shock association and acts accordingly; if Q(S2) is smaller than Q(S1), it does not believe in an association. With no sense data (N=0), Q(S1) = 1-e and Q(S2) = e with small e, so it does not initially believe in a causal relation.
33. The sum over states in (2) is a sum of two terms -- an integral over s and t for Q(S1), and an integral over s', t' and p for Q(S2). These are integrals of polynomials and the result can be expressed as factorials. If a/c is similar to (a+d)/(b+c), the frequency of shocks appears uncorrelated with the occurrence of a tone, and as N increases Q(S1) and Q(S2) diminish at a similar rate so the rat does not come to believe in a causal relation; otherwise, Q(S2)/Q(S1) grows rapidly with increasing N, and the rat comes to "believe" in the causal relation (and act accordingly) as soon as the evidence for the causal relation, in its sense data, are statistically significant.
34. This Bayesian analysis agrees with several known features of associative conditioning (Dickinson, 1980) -- that animals can learn an association in a fairly small number of trials (e.g., a dozen); that a tone-no shock association can be learnt just as well as a tone-shock association; that uncorrelated presentations of tone and shock generally lead to no learning; that strong correlations lead to faster learning; and that some associations (e.g., tone-shock or food-nausea) are learnt faster than others (e.g., food-shock) depending on the prior probability e of the association in the animal's environment.
35. A similar analysis can be used for more complex experimental protocols (with several phases, different conditioned and unconditioned stimuli) and agrees with many other well known conditioning results -- such as habituation, blocking, overshadowing, and learned irrelevance.
36. The two examples in the previous section represent opposite ends of a spectrum -- one governed by well-defined laws of geometry, with no learning required, and the other in a probabilistic domain and requiring learning. In both cases the predictions of the requirement equation agree well with observations, implying that for these cases, animals come close to the requirement equation optimum.
37. More complex examples at either end of the spectrum, or hybrids in the middle (with both deterministic and probabilistic aspects, and requiring learning) can be analysed in a similar manner. Here I summarise the results of several such analyses which I have made.
38. The analysis of the desert ant navigating with one landmark (its own burrow) can be easily extended to navigation with several landmarks. In this case the optimal performance of the requirement equation can be got by the animal constructing an internal cognitive map of its habitat, which represents the positions of all known landmarks as well as its own position, and obeys the constraints of two-dimensional geometry. In this analysis the space of possible states S has at least 2L dimensions if there are L landmarks, and Q(S) is a peaked density in this space. P(S) is rather featureless in the 2L dimensions, but P(D|S) has peaks defining the positions of the landmarks. Evidence that many species use cognitive maps (e.g., O'Keefe & Nadel, 1978; Worden 1992) confirms that they are all near the
requirement equation optimum in this respect.
39. Another important class of examples involves the choice and control of bodily movements, using sense data about the movements of the body and the positions and movements of other local objects (e.g., branches to swing from). All of these objects obey deterministic laws of motion, which, like the laws of geometry in the previous example, are reflected as sharp Gaussian peaks in the probability densities which occur in equation (1). Maximising the expected outcome F(A,D) in equation (1) involves maximising a product of these peak functions subject to the mutual constraints which they embody.
40. Just as the navigation problem can be solved by constructing a cognitive map which obeys the laws of geometry, so the local movement problem can be solved by constructing an internal simulation of the movements of objects, subject to the geometry and laws of motion of those objects -- rigidity, inertia, etc. This simulation finds the peaks of the probability densities in (1) for the optimal choice of action; or equivalently, it predicts the motions of objects for best choice of action to avoid them, eat them, and so on.
41. Generally, whatever the laws of motion of the objects involved, and whatever the sense data which are used to track them (e.g., whatever noise and distortion are in the sense data, and however incomplete they are), the optimal solution is to construct an internal representation of the objects' movement, subject to their laws of motion, and applying the constraints from the sense data. The requirement analysis shows that this is optimal for any problem of motion, and any sense data.
42. Deterministic problems need not involve any learning, if the deterministic laws and constraints have been built into the animal's brain by selection. At the other end of the spectrum, the associative conditioning analysis can be extended to encompass many other cases where learning is needed.
43. In these cases, a near-optimal learner is a Bayesian learner, and will learn any causal regularity of its habitat just as soon as the evidence for that regularity (in its sense data) are statistically significant -- no sooner and no faster (jumping to premature conclusions is as harmful as slow learning). This depends on having a moderately realistic set of prior probabilities for possible causal relations, built into the species' brain by selection. Generally it gives fast learning, from a few learning examples rather than the thousands required by general-purpose neural nets or other reinforcement learners. Optimal learning is fast because it is domain-specific, using well-honed prior probabilities.
44. An interesting hybrid case, which involves both internal representations and Bayesian learning, concerns primate social intelligence (Worden, 1996). The social domain has its own "laws of motion" (e.g., if a monkey is angry, he may threaten you), but these usually involve discrete variables -- such as sex, kin relations, or rank -- rather than the continuous variables of navigation or motion problems. So in this case, the optimal solution involves constructing a discrete (symbolic) internal representation of the social situation, to predict others' actions.
45. Primates may also need to learn new causal regularities of the social domain, and the Bayesian learning analysis applies to this learning. Certain forms of causal relation are a priori much more likely than others; for instance, a rule of the form (A): "if you bite X, then X may bite you back" is more likely to be true than a rule of the form (B): "if you bite X, then Y may bite you back". This is reflected in their prior probabilities in the requirement equation, so that social regularities like (A) are learnt faster than regularities like (B); they can be learnt from a few examples.
46. Again, observation shows that primates are good social predictors and fast social learners -- implying that their behaviour is close to the optimum of the requirement equation (Worden, 1996).
47. The requirement equation can also be used to analyse foraging problems, and gives the same Bayesian predictions as have been independently derived and confirmed in optimal foraging theory (Stephens & Krebs, 1986).
48. From these various examples, the following features emerge:
(1) The requirement equation has a rich structure: The space of states S is usually a complex multi-dimensional space with many discrete and continuous variables. Writing down realistic approximate forms for the probabilities in the equation gives a rich mathematical structure with many detailed consequences.
(2) It works in any domain of cognition: If we can relate some aspect of cognition to a problem the animal faces in its habitat, and to its choices of action, then we can write down the probabilities and analyse the required cognition by the equation. The examples given above illustrate its breadth of applicability.
(3) Useful simple approximations can be made: We cannot write down the precise ecological probabilities of the states etc., which enter into the equation. But we can usually approximate them by simple and realistic factorising forms.
(4) It is computationally tractable: While brute force summation of equation (1) may have prohibitive computational cost, in every domain I have examined one can devise simple heuristics to calculate the sums approximately, at acceptable computational cost.
(5) It gives reasonable results: In all cases, the results broadly agree with reasonable expectations, and often give useful new insights. They are stable against small changes in the probability functions.
(6) It uses representations: Internal representations (or models) of the external situation, which obey any universal constraints of the domain, emerge as a general feature of optimal cognition (Marr, 1982; Johnson-Laird, 1983; Shepard, 1984; Gallistel, 1990). The universal constraints may be laws of motion, or geometry, or illumination, or probability, depending on the domain.
49. As the examples of section III illustrate, the choice of time intervals, assumptions, variables and functional forms to parameterise the terms in the RE is not always straightforward; it requires some judgement to choose forms which are a good approximation to what goes on in nature, yet are in some sense mathematically tractable. Often the states and probabilities need to be defined in spaces of high dimensionality. In this sense the RE is like some very general formulations of the behaviour of physical systems (e.g., the principle of least action, or the Partition Function); physical (or in this case biological) judgement is required to turn them into good and tractable approximations to reality. But it can be done with practice.
50. Similarly, the optimal solution to one problem might compromise the solutions to others; in this case it might be necessary to regard the two as facets of one overall decision problem, to be analysed by the RE. Choices like this are similar to the choice of system boundary in analysing physical problems, and require good judgement to make workable and realistic approximations.
51. In taking a requirement-based approach, we do not suggest that animal brains literally calculate the double sum of equation (1); but if brains are to come close to the optimum, they must somehow give an equivalent result (a similar choice of action in most cases). The challenge is to discover what biological mechanisms give such a result, and how close or far from the optimum the result actually is.
52. Some have argued against the use of representations and symbol systems from a "situated action" perspective (Brooks, 1991; Clancey, 1993; Edelman, 1992); they claim that in the presence of uncertainty, change, and limited sense data, it is too difficult to form internal representations which are adequate to guide action, and it is better to somehow "react" to the external situation. This paper shows that in just those conditions of change and uncertainty, the best choice of action results from the use of internal representations. Selection pressure always leads towards optimal cognition, using representation (Vera & Simon, 1993; Hayes et al., 1994).
53. The simplest case in which the RE leads to an internal representation is the navigation of the desert ant, analysed in section III; a similar argument leads to representations as the optimal solution in more complex cases too. Representation is a general way to find the peaks of the probability functions in (1), and so to find the choice of best action.
54. The argument for this result starts precisely from a situated perspective of the animal in its changing uncertain habitat; perception, thought, and action are integrated in the requirement equation, leading to the cognitivist, representationalist conclusion. The burden of proof is on the anti-representationalists, to say why nature should shun the best solution.
55. Marr (1982) proposed that cognition can be studied on three levels -- computation, algorithm, and mechanism. At the computational level, he analysed what information is needed, and what constraints can be applied, to extract information from sense data -- for instance, to extract 3-D structure from a moving visual field. The requirement equation gives biological motivation for Marr's computational level; it shows how the universal constraints of any domain (such as the laws of illumination and optical flow) enter into the form of optimal cognition -- how they define the target, towards which cognition evolves. We might re-name Marr's computational level the requirement level, making it clear that this is not a level of brain architecture, but a consequence of the biological requirement outside the brain (Sejnowski & Churchland, 1991).
56. Just because an optimum form of cognition exists, we should not expect nature always to attain it (Gould & Lewontin, 1979; Dupre, 1987; Lewontin, 1987). We need to analyse, both theoretically and empirically, just how close animal cognition comes. There are three possible reasons why cognition might not attain the optimum of the requirement equation:
(1) The necessary computations are too complex or costly; with limited brainpower, animals cannot do them.
(2) Brain evolution gets stuck on some local maximum in the design space.
(3) Since the optimum is environment-dependent, it is a moving target, and brain evolution never catches up with it.
57. In the domains I have examined (see above), while a precise computation of the optimum result would be very hard, it is always possible to make good simplifying approximations, which lead to tractable computations (e.g., internal simulations); given the great computational power of the mammalian brain, I tentatively conclude that for most domains, computational cost and complexity are not a problem.
58. Might brain evolution get stuck at a local maximum? Gould (1980) uses the example of the Panda's thumb to illustrate how evolution adapts whatever is at hand for the job; which may be far from optimal. Many believe that similar local maxima occur in the evolution of brains; so that the brains we see today are sub-optimal botches.
59. We can prove a result which suggests not. Suppose a species has decision functions F which are sums of some basis functions G(q,D):
F(A, D) = sigma q H(q, A) G(q, D) (4)
and that the coefficients H(q, A) are under genetic control. If there is a set of values H0(q, A) which give the optimal decision functions of equation (1), then we can show that there are no other local maxima in the space of H. The peak is unique (it is actually a peak region because, for instance, all H can be scaled by the same factor to give the same choices of action).
60. For every point in the multi-dimensional space of H(q, A) the shortest path (the straight line from H(q,A) to H0(q, A)) is a non- descending hill-climb to the peak. (This is proved by defining H(q,A) = H0(q,A) + l*J(q,A) and varying l from zero upwards. We can show that if for some l', F(A,D) gives a non-optimal choice of action A, then for all l >l' F will give the same choice or a worse choice; so fitness decreases monotonically with increasing l.)
61. If brain evolution explores the space of weights H, it will not get stuck at any local false peaks in fitness, because there are none. Evolution can lead to the true peak -- and may do so by the shortest, straight-line, path. Pleiotropy may divert or delay the climb to the peak, but not for ever. So it is unlikely that brain design has got trapped on a local maximum, for all species. Brain design can approach arbitrarily close to the optimum.
62. However, the rate of approach is limited. Optimum cognition, as defined by equation (1), depends on environmental probabilities. For a brain to give optimal cognition, information about those probabilities must be somehow genetically encoded in the brain. As the precise probabilities can only be defined with infinite information, a brain cannot be exactly optimal with limited genetic information.
63. Useful extra information about the probabilities can only accumulate (in the design of brains, through natural selection) at a limited rate. (Worden, 1995) derives a speed limit for evolution, which defines this maximum rate. Genetic information in the phenotype (GIP) is measured in bits. If the variation in probability of survival to adulthood, due to variations between individuals in some facet of their phenotype is +- X percent of the mean survival probability, and the increase in GIP per generation is dG/dn, then the speed limit states that:
dG/dn < X/20 (5)
So a variability in individuals' survival probability P of +- 10% of P can give an increase in useful genetic information of at most 1/2 bit per generation. Evolution cannot improve the design any faster than this. In practice, for reasons discussed in (Worden, 1995), we expect the limit to be at least four times lower. It is independent of population size.
64. The proof of equation (5), which is given in (Worden, 1995), formalises the intuition that survival of, say, one offspring out of eight can (on average) only convey up to log2(8) = 3 bits of useful information about the design of the survivor; so a species with 8 offspring per adult can improve its design by only 3 bits per generation, spread across all its traits.
65. Therefore useful information about the probabilities in equation (1) can only accumulate, in brain design through natural selection, at a very slow rate -- typically a small fraction of a bit per generation (NOTE 1). Over thousands of generations, this may lead to a very precise reflection in the brain of unchanging features of the environment, particularly of universal laws which never change (Shepard, 1984); but for changing features (e.g., caused by the evolution of other species; Van Valen, 1973) brain evolution can never catch up with the optimum. Another factor which changes -- and which alters the optimum towards which cognition evolves -- is the physical makeup of the species, including its sensory apparatus. Brain evolution is pursuing an optimum which depends on all these changing factors.
66. That is the main reason why we do not expect cognition to be optimal. The precise form of the tradeoff -- how far evolution lags behind the optimum -- depends on the rate of environmental change, the amount of information needed to encode environmental probabilities, and the efficiency with which they are encoded. For any aspect of cognition, one can estimate the rough magnitude of these effects.
67. We next discuss empirically how close cognition actually comes to the optimum. For this, we define a scale of cognitive efficiency E, between 0 and 1, for cognitive subsystems. E = 1 is the optimum performance of the requirement equation, and E = 0 corresponds to no subsystem at all. Points in between are defined by interpolation; if optimal performance gives lifetime survival probability P1, complete failure of the cognitive subsystem gives P0, and the animal's actual subsystem gives P, then it has efficiency E = log(P/P0))/log(P1/P0). (This definition is not the same as Fisher's statistical efficiency, used in some analyses of perceptual efficiency.)
68. We focus on higher animal species (mammals and birds) which devote large resources (body weight, energy, genes) to cognition. What cognitive efficiencies do they have? While I have not made detailed comparisons, there is suggestive evidence that mammalian and avian cognitive efficiencies are high (roughly, above 0.8) for many cognitive subsystems. To summarise the evidence:
(A) Foraging: The Bayesian theory of foraging behaviour (Stephens & Krebs, 1986) can be derived from the requirement equation. There are extensive comparisons of this theory with data on birds and mammals, showing good agreement. In finding food, animals seem very often to make Bayesian optimal decisions (e.g., whether to stay or move on, when a food patch is becoming depleted) and so they come close to the optimum of the RE.
(B) Associative Conditioning: The RE analysis of simple conditioned behaviour was sketched in section 3, and can be extended to analyse more complex experimental configurations. The resulting predictions are in broad agreement with many known features of conditioning, such as rapid learning, habituation, blocking, overshadowing, stimulus type dependence, and learned irrelevance (Dickinson, 1980). Gallistel (1990) has shown that other conditioning data such as duty cycle effects are consistent with this near-optimal, representational, model; and Anderson (1990) analyses conditioning data from an optimality standpoint similar to the RE, showing good agreement. For operant conditioning, optimality predictions agree broadly with data, except in highly artificial training regimes (Staddon, 1987) where it is difficult to analyse what an optimal choice would be. All these results imply that conditioning is a very efficient means for animals to learn the causal regularities of their habitat, as fast and reliably as they can be learnt; it is a near-optimal mechanism to help them survive, and so has high cognitive efficiency.
(C) Vision: For many visual tasks, human vision is known to come close to fundamental physical and Bayesian statistical limits (Barlow, 1978, 1980; Burgess, 1990) -- confirming that some parts, at least, of our visual system are near-optimal.
(D) Navigation: Many primate species are good navigators; experiments on navigation in other species (such as rats) are very hard to conduct, precisely because they are so good at it. In a range of species (O'Keefe & Nadel, 1978; Morris et al., 1982; Collett et al., 1986; Worden 1992), behavioural data seem to require a sophisticated cognitive mapping faculty (as predicted from the RE analysis), and suggest that animals often navigate nearly as well as their sense data allow, with high cognitive efficiency.
(E) Human Cognition: Anderson (1990) has compared many aspects of human cognition -- memory, categorisation, causal inference and problem solving -- with an optimal cognition principle, his Principle of Rationality, very similar to that discussed here; he finds good overall agreement with the data. This indicates that these aspects of human cognition have high cognitive efficiency.
69. This diverse evidence suggests that many cognitive efficiencies are high. I have only been able to sketch the evidence here, and have in any case only investigated these cases qualitatively myself. More careful and quantitative studies, to confirm or deny the conclusion that cognitive efficiencies are high, would be very worthwhile. Meanwhile, there is also an evolutionary plausibility argument to believe that efficiencies are high.
70. Suppose some important cognitive system had low efficiency, below 0.8, in most species. This typically gives a fitness deficit of more than 10%, so gives large selection pressure to improve it. If the genetic information required to do so is not, say, above 1000 bits, the deficit might, by equation (5) be corrected in a few thousand generations. It is implausible that such a deficit, which might be corrected quickly on evolutionary timescales, has persisted for so long in most species.
71. Just as we can measure the efficiency of any animal cognitive system -- relative to the requirement equation optimum -- so we can measure the efficiency of a theoretical model, such as a neural net. To do so, we must first place the function performed by the model in a biological context. Then we can define its cognitive efficiency by lifetime survival probabilities, just as for animal cognition. I have not made detailed evaluations; but to illustrate some of the issues:
(A) Computational models of conditioning, such as the Rescorla- Wagner (1972) model or the Mackintosh (1975) model, have been developed to agree with many features of conditioning. For instance, they can learn an association in a small number of trials, as animals can. Therefore they may well have rather high cognitive efficiencies.
(B) By contrast, simple neural net models (such as 3-layer nets with back-propagation learning) often need many thousands of training examples to learn some pattern. This is much slower than typical animal learning, and in a biological context is probably much too slow to be useful; this results in very low cognitive efficiency. For other tasks (e.g., memory with associative retrieval) neural nets may achieve high efficiency (Rumelhart, 1991).
72. Calculating the efficiency of a cognitive model gives a criterion to evaluate it. If efficiencies are in the 0.6 - 0.9 range, then the model, with minor "tuning" by selection, may be the basis of animal cognition. If, on the other hand, its efficiency is low (0 - 0.3) the model as it stands is probably not adequate; if nevertheless the model was an "early" form of cognition, it seems likely that refinement of the model by evolution would have altered its character completely.
73. An alternative to evaluating and refining current cognitive models, using the efficiency criterion, is to construct a model expressly to have high efficiency, guided by the requirement equation -- and then to ask how parts of the model can be built as neural nets. I have taken this approach to primate social intelligence (Worden, 1996) and it appears feasible for many domains.
74. I have shown that there is a best possible form of cognition for any species, and given a mathematical statement, in the requirement equation, of the results of optimal cognition.
75. The equation has a rich mathematical structure, which can be analysed to give insights in any domain of cognition; it justifies key assumptions of cognitivism, such as the use of internal representations and symbol systems.
76. Its results can be approximately calculated, at feasible computational cost; there is no computational reason why animal brains cannot come close to the optimum.
77. We know that animal brains are not precisely optimal. Are they close to the optimum, or do they miss by a long way? Whatever answer emerges, it can hardly be boring:
(A) If animal cognition is very close to the optimum (e.g., E > 0.9) then the requirement equation provides a precisely defined target for theoretical cognitive models to aim at.
(B) If efficiencies are low (say, E < 0.7) why have large deficits in fitness persisted for so long, when (it seems) better fitness is possible? Why has no species ever made the breakthrough?
78. From observed animal performance in several domains, I tentatively conclude that animal cognition comes close to the optimum of the requirement equation. This has important implications for cognitive models (some of which have low cognitive efficiencies). The conclusion needs to be confirmed or refuted by more thorough studies.
79. Doing such studies will make links between ecology, ethology, cognitive science and neuroscience. By examining the probabilities of events in an the habitat (ecology) we can predict what an animal with optimal cognition would do; we can compare this with observed animal behaviour (ethology) and with what animals with a particular cognitive mechanism would do (cognitive science and neuroscience). We need to bring all these viewpoints together, to progress in our understanding of the brain.
80. For two key results in this paper I have asserted that "it can be proved" without (for space reasons) giving the proof. The two results are (a) that the Requirement Equation gives the best possible fitness, and (b) that under certain conditions, brain evolution towards this optimum cannot get stuck at any local maximum. I am aware that the lack of these proofs is a bit unsatisfactory; however, the proofs are not long, and are available from me on request.
I thank the referees for helpful comments on earlier drafts of the paper.
Anderson, J.R. (1990) The Adaptive character of thought, Lawrence Erlbaum Associates.
Barlow, H.B. (1978) The Efficiency of Detecting Changes in Random Dot Patterns, Vision Research, 18, 637-50.
Barlow, H.B. (1980) The Efficiency of Perceptual Decisions, Philosophical Transactions of the Royal Society (London), B290, 71-82.
Brooks, R.A. (1991) Intelligence without Representation, Artificial Intelligence, 47: 139-159.
Burgess, A.E. (1990) High level Visual Detection Efficiencies, in Vision: Coding and Efficiency, Ed. Blakemore, C., Cambridge University Press, 1990.
Clancey, W.J. (1993) Situated Action: a Neuropsychological Interpretation, Cognitive Science, 17:87-116.
Churchland, P.S. and T.J. Sejnowski (1988) Neural representations and neural computations, in L. Nadel ed. Neural Connection and Mental Computation, MIT Press, Cambridge MA.
Collett, T.S., B.A. Cartwright and B.A. Smith (1986) Landmark Learning and Visuo-spatial Memories in Gerbils. J. Comp Physiol. A 158:835-851.
Dickinson, A. (1980) Contemporary animal learning theory, Cambridge University Press.
Dupre, J. (1987) The Latest on the Best, MIT Press, Cambridge, MA.
Edelman, G.M. (1992) Bright Air, Brilliant Fire: On The Matter of the Mind, New York: Basic Books.
Fodor J. and Z. Pylyshyn (1988) Connectionism and Cognitive Architecture, Cognition 28, 3-71.
Gallistel, C.R. (1990) The organisation of learning, MIT Press, Cambridge, MA.
Gould, S.J. and Lewontin R.C. (1979) The Spandrels of San Marco and the Panglossian Paradigm: A Critique of the Adaptionist Program. Reprinted in Conceptual Issues in Evolutionary Biology: an Anthology, Ed. E. Sober, MIT Press 1984.
Gould, S.J. (1980) The Panda's Thumb, W.W. Norton, New York.
Hayes, P.J., K.M. Ford and N. Agnew (1994) On Babies and Bathwater -- a Cautionary Tale, AI Magazine, Fall 1994, 15-26.
Johnson-Laird, P.N. (1983) Mental Models, Cambridge University Press, Cambridge.
Lewontin, R.C. (1987) The Shape of Optimality, in The Latest on the Best, ed. J. Dupre, MIT Press, Cambridge, MA.
Machine Learning (1992) Special Issue on Reinforcement Learning, Vol 8 no. 3/4.
Mackintosh, N.J. (1975) A Theory of Attention: Variations in the Associability of a Stimulus with reinforcement. Psychological Review, 82, 276-98.
Marr, D. (1982) Vision, W.H. Freeman.
Maynard Smith, J. (1982) Evolution and the Theory of Games, Cambridge University Press.
Morris, R.G.M., P. Garrud, J.N.P. Rawlins and J. O'Keefe (1982) Place Navigation Impaired in Rats with Hippocampal Lesions. Nature 297: 681-683.
O'Keefe, J. and L. Nadel (1978) The Hippocampus as a Cognitive Map. Clarendon Press, Oxford.
Rescorla, R.A. and A.R. Wagner (1972) A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In A.H. Black and W.F. Prokasy (eds) Classical Conditioning II: Current Research and Theory, pp 64-99 Appleton- Century-Crofts, New York.
Rumelhart, D.E. (1991) The architecture of mind: a connectionist approach, in M.I. Posner, ed., Foundations of Cognitive Science, MIT Press, Cambridge MA.
Sejnowski, T. and P.S. Churchland (1991) Brain and Cognition, in M.I. Posner, ed., Foundations of Cognitive Science, MIT Press, Cambridge MA.
Shepard, R.N. (1984) Ecological Constraints on Internal Representation: Resonant Kinematics of Perceiving, Imagining, Thinking and Dreaming. Psychological Review 91: 417-447.
Staddon, J.E.R. (1987) Optimality Theory and Behaviour, in The Latest on the Best, ed. J. Dupre, MIT Press, Cambridge, MA.
Stephens, D.W. and J.R. Krebs (1986) Foraging theory, Princeton University Press.
Van Valen, L. (1973) A New Evolutionary Law. Evolutionary Theory, 1, 1-30.
Vera, A. and Simon, A.H. (1993) Situated Action: A Symbolic Interpretation, Cognitive Science, 17:49-59.
Wehner, R. and Srinivasan, M.V. (1981) Searching behaviour of desert ants, genus Cataglyphis (Formicidae, Hymenoptera). Journal of Comparative Physiology, 142, 315-338.
Worden, R.P. (1992) Navigation by fragment-fitting: a theory of hippocampal function, Hippocampus 2, 165-188.
Worden, R.P. (1995) A Speed Limit for Evolution (Journal of Theoretical Biology, 176, 137-152).
Worden, R.P. (1996) Primate Social Intelligence (to be published in Cognitive Science).
1. These considerations apply only to innate information in the brain; of course much information can accumulate within an animal's lifetime by learning. Equation (5) applies to the innate information G which is needed to make learning efficient or unnecessary, and the cost in fitness X which results from having to learn, or the slowness of learning. Brains can learn much faster if they are well-attuned to the type of regularity being learnt.