David M.W. Powers (1993) Calm, Chaos and Surprise!. Psycoloquy: 4(36) Categorization (7)

Volume: 4 (next, prev) Issue: 36 (next, prev) Article: 7 (next prev first) Alternate versions: ASCII Summary

Topic:

PSYCOLOQUY (ISSN 1055-0143) is sponsored by the American Psychological Association (APA).

Psycoloquy 4(36): Calm, Chaos and Surprise!

CALM, CHAOS AND SURPRISE!
Book Review of Murre on Categorization

David M.W. Powers
Informatique Department
Telecom Paris (ENST)
Paris, France

powers@acm.org

Abstract

Murre has put together an interesting and useful model based on a number of reasonable assumptions. The assumptions are presented in the first chapter and the model in the second. The remainder of the book presents some small experiments showing that the model is useful, and has some psychological validity. The hints in the appendices about implementing parallel nets should be of considerable value, but I have not taken the time to consider them here. Some aspects of Murre's model use techniques which are of theoretical importance in the field of parallelism per se, and both this observation and the discussion in the appendices are evidence of the interrelationship between the field of parallel distributed processing and mainstream parallel computing. Murre is to be praised for the breadth of his examination of this area.

Keywords

Neural networks, neurobiology, psychology engineering, CALM, Categorizing And Learning Module, neurocomputers, catastrophic interference, genetic algorithms.

I. INTRODUCTION

1.1 Murre's "Learning and Categorization in Modular Neural Networks" is very nearly two books, not in virtue of its size, but in that the chapters which comprise the body of the text discuss a particular neural network model in terms of the psychological plausibility of the model, whilst the appendices (amounting to well over a third of the volume) are devoted to questions of implementation, specific and general.

1.2 There is a tendency to regard appendices as superfluous, or at best as a source of further detail about points raised in the text. In this case, the 5 Appendices B which deal with general questions relating to implementation really address an entirely different issue from the book proper, and the reader needs especial encouragement to look at what is in them -- particularly if he has used one or another of the implementational techniques or environments or approaches discussed, on the one hand, or if his ideas of models of neurophysiological processes have remained unrealized or unrealizable in any computational form. Perhaps it would have been better if Murre had included at least the main points of B1 in the text and introduced the content of the remaining appendices in overview form.

1.3 This review, however, will concentrate on the theoretical content on the book, although I will point out at the end some connections to the theory and practice of parallel computing, which Murre does not elucidate himself.

II. CALM

2.1 CALM is an acronym for an anagrammatic variant of the title of the book: Categorizing And Learning Module. In both his introduction and his concluding evaluation chapter, Murre makes it clear that modularity, and indeed one particular approach to modularity (which contrasts sharply with that of Fodor 1983, 1988), represents his main thesis and the unique contribution of his work. The terms "categorizing" and "learning" are also not lightly chosen: the particular type of learning that CALM captures is based on the type of categorization performed by self-organizing neural nets, and the type of modularity is inspired by that of the cortical columns (or what might underlie them). Indeed, the stronger claim of CALM is that this sort of learning can achieve more than it is given credit for, and is more plausible than the traditional Neural Net (NN) approaches to learning.

2.2 The problems that existing NN models have, and which existing experiments have demonstrated, lead not only to the selection of this particular class of model, but to a number of enhancements -- most of which have a prior pedigree but have not hitherto been brought together in this combination.

2.3 It is worth considering this motivation for the model in some detail, as it lays the foundation for the experiments and results which follow -- indeed, the subsequent demonstrations are simple experiments designed to support the motivation presented earlier. In particular, various problems are used to justify the introduction of randomly (or chaotically) firing arousal nodes, although there is also a side excursion into a combination with genetic algorithms.

2.4 In general, I like the approach (in part because it accords well with my own in several respects: Powers & Turk 1989, Powers 1992). In addition, the problems identified are significant -- although many alternatives and variants are possible in dealing with them. As this is basically a friendly review, I will try to broaden the perspective taken by the book and to provide further support for the approach whilst being less critical of the experiments than would be possible if one neglected to see them as illustrations of a broader approach. In the following, I will focus on the problem Murre is seeking to deal with, and his general approach, rather than on the specific instantiations or implementations (CALM, CALSOM, ELAN, etc.)

III. PROBLEMS

3.1 NEGLECT OF LEARNING.

3.1.1 Traditional AI, and in particular traditional approaches to natural language and vision, have tended to leave learning out of the equation. This is not so much for lack of consideration of learning, but because of the intractability of the approach (and the spectra of formidable formal impossibility results: Chomsky 1963, Gold 1967, Minsky and Papert 1969). The definition of intelligence has also been a problem. It took AI quite a while to recognize that the real problems of AI lay not in "expert" knowledge, as in solving mathematical equations or playing chess or diagnosing diseases, but in the basics of cognition -- understanding sentences and scenes. Large subfields have built up in this area based on ad hoc approaches, or arbitrarily chosen psychological or linguistic theories, but they fail to address the dynamics of the problem -- that we are always encountering new words and objects as well as new combinations thereof.

3.1.2 Seeing how learning algorithms, and in particular the black box neural network techniques, actually get beyond the daunting formal results, and can in practise allow learning to be used, is quite an art in itself. There is a field of machine learning, with its own journal of that name, but it tends not to address even such issues, let alone the place of learning in cognitive science in general.

3.1.3 Traditional connectionist work is conducted in a paradigm which emphasizes associations at a single level and fails to consider how more complex learning systems -- e.g., a system for learning language rather than learning plurals -- can be formed. In fact, the impression we sometimes get is that people are throwing the black box at random problems and if it works, a paper gets published; otherwise, we look for another problem.

3.1.4 What we need is thus some sort of implementable theory of modularity. It is not sufficient just to introduce a couple of dozen more hidden layers and hope for the best. We need an approach which is both efficacious (in the sense that it is theoretically capable of learning what we want it to learn) and efficient (capable of learning it in the time we are willing to wait). Modularity offers an additional bonus: that parallel implementations can maintain a reasonable independence and a well-defined interdependence between modules.

3.2 INTERFERENCE

3.2.1 Another major problem is interference, that is, that once a system has learnt something, and proceeds to something else, there is a tendency for that to cause unlearning of the established information. Modularity again has something to contribute here, in that different modules can have different learning coefficients. But part of the problem is the more complex use of existing learnt concepts and the fact that the system has been receiving over-simplified data until this point.

3.2.2 At a given level of complexity, some algorithms are not well behaved in relation to the occurrence of items with some uneven probability distribution. The use of categorizing, self-organizing nets has distinct advantages in this respect. An additional arousal node which reacts to the surprise value of the input and facilitates learning has been introduced in CALM. The external evidence of arousal due to unexpected events does not necessarily suggest the sort of direct feedback CALM provides. Not all unexpected events are worthy of learning, that is, of transfer to long term memory or incorporation into schemas.

3.2.3 The other feature of the arousal node in CALM is that it is not completely deterministic. In the original self-organizing systems (Turing 1952, von der Malsburg 1973) random initial values provided the ability to break ties. The most important property of such networks, though, was the allocation of neural resources in (not necessarily linear) proportion to the relative frequency of that (range of) inputs -- with correspondingly greater capacity for discrimination of such inputs.

3.2.4 An alternative way of controlling the learning coefficient (plasticity) is in terms of the relative saturation of the module, or by invoking maturational factors (e.g., it is harder to learn a language in one's thirties than as an infant). The saturation effects of a normalizing model can actually make it difficult to learn new material once the old is really well established. Thus, the effect of an arousal node may not be so much to prevent interference, but to allow later learning at all (as Murre(p.5) mentions in relation to ART).

3.2.5 The distinction between elaboration learning and activation learning is a useful one, but it would be interesting to investigate the behaviour of the system in the absence of the arousal node (and, in fact, some of Murre's genetic learning experiments did build just such systems, which still performed acceptably -- the main value of these genetic trials is seeing what use is made of the special features of CALM).

3.2.6 An important function of (all) self-organizing systems is that the allocation of many cells to the same class -- either in one column or multiple columns -- allows discrimination at a finer level with continued learning. Murre's experiments provide exactly the same number of neuron pairs (cells) as the number of classes he is trying to learn (ten); Murre then discusses the fact that it doesn't always find all of them. He should be using at least twice as many cells as the number of classes he wants to learn in a module that needn't have full connectivity but inhibits maximally at a radius of the order of that number of cells (e.g., with a radius of 75% the number of classes, which shouldn't necessarily be uniformly distributed, this allows a full cycle in which each class is recognized by an expected 1.5 cells).

3.3 SUPERVISION

3.3.1 The biggest single problem with the majority of today's symbolic and connectionist learning techniques is that they require supervision. The question of supervision, however, is glossed over a bit too quickly by Murre, and admits of more detailed analysis.

3.3.2 Supervision has two components, providing in general both a source of examples and a source of criticism. This may take the form of providing positive and negative examples, with association of a true/false rating. This is just a special case of a general categorization problem in which we provide a set of examples with multicategory taxonomy (e.g. cats, dogs, cows, pigs). The criticism (whether true/false or multivalued) can be provided independently of the examples (e.g. the Marvin system of Sammut and Banerji, 1986, invents its own examples -- intelligently choosing one which tests the latest generalization it has proposed).

3.3.3 Supervision may also be explicit or implicit. When I try to open a door by just pulling on the handle, I get an implicit no -- if the correct procedure for opening the door involves turning, or pushing or sliding. This is implicit learning provided by the external world in which the learning system is embedded. Similarly, the examples may occur naturally in the world. The difference between explicit and implicit examples is, in practice, largely the way in which the focus of the learner is brought to bear on the significant aspects of the work.

3.3.4 As an example, backpropagation systems are usually regarded as involving supervised learning, although at times they seem to be doing unsupervised learning (e.g., where input and output are the same and we are just interested in the hidden layers, or in what the net does with patterns which weren't present in the input). Conversely, Rumelhart and McClelland (1986) used a self-organizing net (usually thought of as an unsupervised learning system) to learn associations between present and past tense of the verb.

3.3.5 Thus, it is not the system which determines the question of supervision per se, but the paradigm with which it is used. When it is mostly used following one paradigm, however, focussing particularly on the critical aspect of supervision, a method tends to become associated with that specific paradigm.

3.3.6 Yet a system is also supervised when it gets a highly consistent, selected set of examples which reflects only the set of concepts or associations to be learnt. Thus, von der Malsburg (1973) provides a set of images each of which consists of exactly one line, and Murre provides a set each of which consists of a single character. How is this different from providing a set which consists of a sequence of characters, whether it represents one word or two? In both cases, the teacher has done the job of focussing the strategy by filtering out extraneous information (the other letters on the page, the edges of the other pages, the book, the desk, the reader's fingers, etc.). There are paradigmatic examples in nature -- where, for example, different verb forms frequently occur in successive clauses ("Yesterday, you played in the sand, so just play with your toys! Maybe tomorrow you can go and play with Tom, who's playing football today.")

3.3.7 Nonetheless, I agree with Murre that self-organizing systems are the appropriate modular building blocks for at least the lowest levels of cognition. The reason is not so much that backpropagation could not be used as that that paradigm amounts to overkill, and that the job we have in hand at these lower levels is classification on the basis of the internal structure of the input; in particular, self-organizing systems can do auto-focussing (e.g. in Powers & Turk 1989, Powers 1992): given just sentences as sequences of, respectively, words or characters, without preselection, significant classes emerge automatically.

3.4 CONSTRAINTS

3.4.1 Constraints seems to fall naturally under the heading of problems, but I agree with Murre that the problem is actually lack of constraint. This is the main fallacy in the application of the theoretical results of Gold and Chomsky, Minsky and Papert (op. cit.) to cognitive questions: any constraints reflected in the presuppositions have more to do with particular, very simplistic classes of computers than the constraints imposed by our cognitive mechanisms and our environment.

3.4.2 To clarify this further, there are two fallacies: One is that we are closed systems, and can't just "learn" new structure where there wasn't something "innate" before; the other is that results applying to classes allowing infinite recursion and infinitely long sentences, nestings, parse trees etc. have anything to do with "languages" which are customized for our finite brains -- and the very finite constraints which show up empirically (cf. Miller's [1957] "magical number seven"). The trap is also related to a focus on the syntactic, and a formal syntactic idea of semantics, failing to recognize the function of syntactic structure or the need for grounding of and through an ontology.

3.4.3 Conversely, it is ironic that the results of Minsky and Papert (1969) were founded on a principal of locality that has largely gone by the board in mainstream connectionism today. It is not that Minsky and Papert were wrong but that people have since changed the rules of the game, sweeping away any semblance of neurological validity. Minsky and Papert were in fact right, but what does it matter if local nets can't learn parity functions: We have to count things consciously to learn parity; it is not an unconscious or autonomous function of our visual cortex! Similarly, the result concerning connectivity shouldn't be a surprise, given our predilection for puzzles relating to this feat in our newspapers and our children's books and comics (the puzzles are known as mazes). On the other hand, the fact that nets can achieve the rather esoteric notion of convexity should be no surprise either -- we do detect holes and wholeness without conscious computation!

3.4.4 The use of delays and decays also introduces small constants which one would expect to relate to some of these magic-number-seven constraints. The advantage of modelling decay is not just a form of limited parallel-to-serial conversion (p. 70), but that it opens up a window of past tokens which can be taken into account (perhaps by other modules) in learning more complex, and (temporal or sequential) context-sensitive relations.

IV. CHAOS

4.1 The main criticism I would make, as is reflected fairly indirectly in the above, is that there is more work and more support for Murre's approach than is reflected in his book.

4.2 A module needn't be so small as to require full connectivity. The modifications in Part III deal with a separate plausibility problem, the one-to-one association of inhibitory and excitatory neurons. In fact, a probablistic connection model can be applied to both classes of neurons, and can model many different lateral interaction functions in an extremely simple way (Powers, 1983). Neither the CALM nor the CALSOM model reflects the usual sombrero. Moreover, this deviation from empirical tradition is not justified (except implicitly on the basis of simplicity and sufficiency).

4.3 There is other work which seeks to model lesion effects and disorders, e.g. that of Gigley (1982) on Aphasia.

4.4 The mixing of genetic algorithms and CALM is a brave step, but one which is somewhat poorly motivated. In particular, there is no motivation for either psychological or genetic plausibility. In particular (p. 107), there is no mechanism in view which could explain the connection that "in ontogenesis evolution guides learning, in phylogenesis learning guides evolution": Appealing to survival of the fittest is a very vague way to support the latter part of this claim, and there are many missing links in the chain of mechanism. A better motivated example of the connection between innate specification and self-organization is Willshaw and von der Malsburg's (1979) model and their subsequent work.

4.5 The use of a randomized source (arousal node) in CALM is not completely convincing: it is moreso in the motivation than in the implementation Murre presents. However, there is evidence for such elements. Nicolis (1991), for example, discusses various phenomena (e.g. in relation to language) and relates the behaviour of an element to the attractors of chaos theory.

V. SURPRISE

5.1 Many neural models are characterized by parameters expressed to four significant digits or more. The robustness of Murre's CALM simulation is indicated by the variety of different functional versions found (e.g. by genetic learning) as well as by the fact that the parameters in Appendix A are expressed to one significant figure or less!

5.2 The relationship between neural network approaches and statistical approaches is one that has not been adequately undertaken to date and Murre is to be commended for his (albeit brief) treatment of this (p. 88). The successes and failures of different connectionist and stochastic approaches may be related to deeper considerations in information theory.

5.3 This issue is not unconnected with that of redundancy (p. 93) and novelty (p. 5). If we model the communication process as a tension between efficacy of communication (getting the message across without corruption) and efficiency of communication (maximizing speed and minimizing memory requirements and transmission time), we get a model in which there is an asymmetry, with a small closed class of very frequent symbols interacting with larger (potentially infinite) open classes of less frequent symbols with a much higher information content (viz. surprise value; Powers, 1991, 1992).

5.4 These closed classes can be recognized with a totally unsupervised self-organizing, or minimal length representation, or bigram-based, learning algorithm and have a minimum of semantic content, while pointing to the content "words" they consort with. They also show up in terms of prosodic effects, where the closed class "words" tends to have less stress, or to be less reliably distinguishable.

5.5 The role closed classes play in self-organizing modules also seems to turn the traditional syntactic concept of headship on its head: the closed class units seem to occupy the nuclear syntactic slot whereas traditional theories tend to allocate this slot to the primary carrier of information (viz., the most open class).

VI. CONCLUSIONS

6.1 Murre has put together an interesting and useful model based on a number of reasonable assumptions, some of which, though by no means widely accepted, seem incontrovertible. The assumptions are presented in the first chapter and the model in the second. The remainder of the book presents some small experiments showing that the model is useful, and has some psychological validity (though not an overwhelming amount).

6.2 The hints in the appendices about implementing parallel nets should be of considerable value, but I have not taken the time to consider them here. Some aspects of Murre's model, in particular, the loosening of the RV pairs in 7.2.1 (p. 121), use techniques which are of theoretical importance in the field of parallelism per se (this halving technique is essential to the only known network structure with constant degree nodes which allows arbitrary communication -- permutations of messages from all nodes to all other nodes -- in time proportional to the diameter of the network, viz. the multibutterfly), and both this observation and the discussion in the appendices are evidence of the interrelationship between the field of parallel distributed processing and mainstream parallel computing. Murre is to be praised for the breadth of his examination of this area; the amount of space I have devoted to it is not an indicator of the amount of time the reader should. To review that material would be to write a whole other review!

REFERENCES:

Chomsky, Noam (1963) Formal Properties of Grammars. In: Handbook of Mathematical Psychology, R. A. Luce, R. R. Bush and E. Galanter (eds.) vol. II, pp. 323-418, Wiley, New York.

Fodor, J. (1983) The Modularity of Mind. MA: MIT Press.

Fodor, J. (1988) Psychosemantics. MA: MIT Press.

Gigley, H. M. (1982) Neurolinguistically Constrained Simulation of Sentence Comprehension: Integrating Artificial Intelligence and Brain Theory. Ph.D. Thesis, University of Massachusetts, Amherst Massachusetts.

Gold E. M., (1967) Language Identification in the Limit. Information and Control 10:447-474.

Malsburg, C. von der (1973) Self-Organization of Orientation Selective Cells in the Striate Cortex. Kybernetik 14: 85-100.

Minsky, M. & Papert, S. (1969) Perceptrons. MIT Press.

Murre, J.M.J. (1992) Learning and Categorization in Modular Neural Networks. UK: Harvester/Wheatsheaf; US: Erlbaum

Murre, J.M.J. (1992) Precis of: Learning and Categorization in Modular Neural Networks. PSYCOLOQUY 3(68) categorization.1

Nicolis, John S. (1991) Chaos and Information Processing. World Scientific. Singapore/London.

Powers, David M. W. (1983) Lateral Interaction Behaviour Derived from Neural Packing Considerations, DCS Report No 8317, Department of Computer Science, University of NSW, Australia.

Powers, David M. W. & Turk, Christopher C. R. (1989) Machine Learning of Natural Language, Springer, London/Berlin, December.

Powers, David M. W. (1991) How far can self-organization go? Results in unsupervised language learning. pp.131-136, Proc. AAAI Spring Symposium on Machine Learning of Natural Language and Ontology, DFKI:Kaiserslautern FRG.

Powers (1992) A Basis for Compact Distributional Extraction THINK 1(2): 51-63. ITK:Tilburg.

Rumelhart, D. & McClelland, J. (1986) On learning the past tenses of English verbs. In: McClelland, J. and Rumelhart, D. (eds.) Parallel Distributed Processing. Vol. 2: Psychological and Biological Models. Cambridge: MIT Press. 216-271.

Sammut, Claude & Banerji, R. (1986) Learning concepts by asking questions. In: Machine Learning: An Artificial Intelligence Approach. R. S. Michalski, J. G. Carbonell and T. M. Mitchell (eds), vol. 2.

Turing, A. (1952) The chemical basis of morphogenesis. Phil. Trans. Roy. Soc. London Ser. B 237:37-72.

David J. Willshaw and C. von der Malsburg (1976) How patterned neural connections can be set up by self-organization. Proc R. Soc. London Ser. B 194:431-445.

Volume: 4 (next, prev) Issue: 36 (next, prev) Article: 7 (next prev first) Alternate versions: ASCII Summary