The computational building blocks of biological information processing systems are highly interconnected networks of simple units with graded overlapping receptive fields, arranged in maps. In view of this basic constraint, it is proposed that the present stage in the study of cognition should concentrate on gaining understanding of the cognitive system at the level of the distributed computational mechanism. The model of script understanding introduced in the Miikkulainen's (1993) book appears promising, both because it treats seriously the question of the architecture of the language processor, and because its architectural features resemble those used in modeling other cognitive modalities such as vision.
1. Imagine that you meet an intelligent Martian whose anatomy allows you to peek into parts of its brain. After some experimenting, you discover that the alien's brain has a transparent "vision organ" whose workings you can explore and even understand. You congratulate yourself mentally in anticipation of an ultimate reductionist victory in the understanding of (alien) cognition, only to find out that the alien's "language organ" is, alas, opaque. Still, there is no need to despair: it turns out that, as any xenobiologist knows, the Martian brain is anatomically uniform, and there is good reason to believe that the same basic architecture supports all perceptual and cognitive (and motor!) functions.
2. Obvious parallels can be drawn between this imaginary scenario and the current situation in cognitive science. First, the similarities in the architecture of the various cortical areas seem to outweigh the differences between the different areas (Braitenberg, 1977). Second, thanks to multidisciplinary studies combining computational modeling with psychophysical research and animal neurophysiology, the current understanding of vision is at a stage where integrated theories of higher visual functions such as recognition can be advanced and tested. This makes the workings of the "vision organ" of primates clearer, at least in comparison with the workings of other cognitive faculties.
3. What are the implications of the uniformity of cortical architecture, and of the anatomical and functional knowledge gained through the advances in vision research, for the development of a theory of human language? It is important to realize that a parallel between vision and language is only useful insofar as it can be shown that both vision and language are supported DIRECTLY by the fine-grained distributed architecture found in the brain, and that neither of the two functions needs a "virtual machine" of the kind presumably involved in the human ability to do abstract logic.
4. In vision, the efforts to understand what has been regarded until recently as mere "implementational details" intensify as the architecture-oriented approach proves more and more productive and theories begin to emerge that are phrased in terms of receptive fields, maps, and other notions from the lexicon of neurobiology. In linguistics, the "connectionist" challenge is being met differently, if at all. Major drives towards conceptual change here are the difficulty in imagining how abstract rules can be embodied in a brain-like architecture and the lack of empirical evidence for the psychological reality of such rules (NOTE #1). The straightforward approach to the bridging of this gap, based on the notions of a virtual machine and of simulated symbolic computation (Ajjanagadde and Shastri, 1991), may well prevail in the end, but if it does, then few, if any, insights from vision would be applicable to the understanding of language.
5. Another, subsymbolic, approach, of which Miikkulainen's (1993) DISCERN system is an example, aims to show that complex linguistic tasks can be performed using the very same machinery used for years in models of vision: receptive fields, maps, etc. Despite the understandably limited extent of DISCERN's domain of operation, the success of the system in the task it faced bodes well for the idea of grounding a unified cognitive science in biological reality.
6. Biological considerations lead to analogies between vision and language that are quite different from those usually found in the literature. For example, Carroll and Bever (1976, p.339), stress the discrepancy between the two-dimensional nature of the retinal stimulus and the three-dimensional nature of the world, and liken the knowledge of perspective geometry (presumably necessary to recover the 3D world from the 2D images) to the knowledge of the syntactic rules (presumably necessary for the recovery of linguistic structure). A similar preoccupation with perspective geometry is found in Miller (1962), who writes that "just as the student of space perception must have a good understanding of projective geometry, so a student of psycholinguistics must have a good understanding of grammar" (p.756).
7. One may observe, however, that perspective geometry is the least of problems encountered in space perception in biological vision, the real problem being the reduction of dimensionality of representation from the million or so dimensions in each optic nerve down to a manageable number (and, eventually, to 3D space, which is, in a sense, a construct and not a percept; cf. Poincare, 1913). The lesson of this observation is that insights should be sought in those domains where architectural constraints are likely to impose special limitations on the power of biological information processing. The present commentary raises three issues in connection with which analogies between theories of vision and language may prove especially fruitful. These issues -- similarity, binding, and generalization -- are discussed in the following three sections separately, although they can also be considered as different aspects of the same problem: the representation of structure.
8. The word "similarity" appears eight times in Miikkulainen's (1994) Precis, and many more times in the book itself. The prominence of this concept has a good reason: as the author notes on p.19, the power of parallel distributed processing systems lies in their natural ability to treat similar inputs in a similar manner (note that this characteristic is not exclusive of the PDP approach; see Omohundro, 1987). There are two main problems with invoking similarity as an explanatory concept.
9. First, similarity is vague in that it admits many equally valid definitions, each of which depends on a particular choice of features and feature weights. Fortunately, as far as simple ("surface-level") tasks such as lexical categorization of words are concerned, representations amenable to further similarity-based processing can be generated by statistical means from a stream of unparsed and not always grammatical English (e.g., the Internet newsgroup traffic; Finch and Chater, 1991).
10. The second problem with similarity is that it is often very difficult to compute at the "deep" level, as opposed to the surface level of the input structure. For example, in visual recognition of human faces, surface (pixel-based) similarity between images of the same individual under different illuminations may be smaller than the similarity between images of different individuals under the same illumination (Adini et al., 1993). In this case, processing by a bank of center-surround receptive fields helps to compute the true deep-level similarity between faces (Weiss and Edelman, 1993).
11. In vision, similarity-based classification of isolated words considered outside normal sentence context may be likened to the recognition of isolated objects, with inflection being to some extent analogous to the apparent deformation and partial self-occlusion of 3D shapes, precipitated, for example, by rotation in depth. There are indications that this visual task can be learned in a sufficiently simple architecture, based on interpolation from examples (i.e., familiar views) by a collection of receptive fields (Poggio and Edelman, 1990). Moreover, an extension of the same basic architecture may be capable of supporting CLASSIFICATION of complex objects (that is, making sense of novel instances of a number of object classes; see Edelman, 1994).
12. In language processing, the object classification task parallels the parsing of sentences that are novel, yet simple (e.g., right- branching); the processing of complex sentences (e.g., those with center-embedded clauses) is more like the understanding of a scene in which objects may occlude each other or form juxtapositions that encourage false segmentation. Can the similarity-based approach to language cope with complex inputs as well? The solution to the problem of general scene understanding in vision can be greatly facilitated by cues not directly connected to shape, such as distinctive color, texture, or common motion of objects, all of which help to segment the scene into the proper constituent shapes. It would be interesting to consider the potential counterparts of scene segmentation cues in language processing. Some of the possible sources of segmentation information are prosody (in the spoken language), punctuation marks, lexical information associated with the individual words (such as knowledge of their syntactic categories), and semantic information, which acts in a top-down fashion.
13. Segmentation is especially useful if its outcome can be represented in a form that makes the various segments of the input explicit. One possibility is to represent the resulting information in a population of units, each of which is selective along several distinct dimensions. For example, units in the visual cortical area V4 in the monkey (a major way station in the shape processing pathway) respond selectively to location in the visual field (this corresponds to the standard notion of receptive field), to shape (Kobatake and Tanaka, 1994), and to the direction of stimulus motion, with the latter selectivity conjectured to play a role in Gestalt-like segmentation by common fate (Cheng et al., 1994). An alternative approach is to rely on physically separate populations to represent the segmented information, for example, by channeling it into distinct maps, each of which preserves information regarding the location of the stimulus in the visual world (Treisman, 1985). Map-like mechanisms are widely used in the DISCERN system, and are also discussed below in the context of binding by retinotopy.
14. A consideration of the deep-structure problem in a more general context leads naturally to the issue of binding. Somewhat paradoxically, it is segmentation that introduces the need for binding. For example, in multiple-map segmentation the binding problem arises because information pertinent to a given object is distributed across a number of distinct feature maps.
15. In visual recognition, the need for binding is especially severe in those theories that call for explicit representation of object parts and their spatial relationships. According to these theories, the represented object is first taken apart by describing it in terms of generic primitives. This step later gives rise to the problem of putting the whole object together again.
16. An analogy between shape vision and language which draws a parallel between generic shape parts on the one hand and phonemes in the spoken language on the other has been suggested by Biederman in his theory known as Recognition By Components or RBC (Biederman, 1987; it should not be surprising that an implementation of the RBC theory relies heavily on binding; see Hummel and Biederman, 1992). One can also imagine here an analogy between shape parts and words in a sentence. When a complex shape is represented in terms of its parts, the parts must be bound together according to the proper spatial relationships, not unlike words in a sentence that must observe certain syntactic rules for the sentence to be well formed.
17. In contrast with Biederman's model, it has recently been claimed that the representation of visual shape requires no special binding mechanism over and above the coactivation of shape-specific feature detectors (Edelman, 1994). This claim is based on the idea (whose roots may be traced to Putnam, 1988) that the visual world is its own best representation, and, in particular, that the perceived space is a convenient substrate to which visible entities may be bound by virtue of the retinotopy of their representation in the brain.
18. Retinotopy may help solve the binding problem as follows: a detector that is selective for a complex shape will signal the presence of the shape parts only when these are correctly arranged, simply because the response of such a detector depends on a conjunction of responses of the individual, properly positioned part detectors. The combinatorial problems usually cited in connection with this approach are much milder than what they seem if channel coding (Snippe and Koenderink, 1992) is used instead of unit coding.
19. In language processing, the need for binding may seem to be unavoidable: after all, sentences ARE strings of reusable generic entities (words) bound together in a certain temporal structure that obeys abstract rules of syntax. In view of this commonplace observation and of the difficulties in implementing binding in neural hardware, any indication that an analogue of retinotopy can help represent linguistic and discursive structure would be very welcome. The approach taken by Miikkulainen provides just such an indication: DISCERN does a kind of space-based binding, by transforming a temporally extended input (a sequence of sentences, each of which, in turn, is a grammatical sequence of words) into a spatially extended inner representation in which the entire temporal structure is represented as a timeless "snapshot". Conceptually, this approach is like the recognition of a temporal entity such as horse gait through a consideration of the shape of the world line of the animal in four-dimensional spacetime.
20. The concept of representation SPACE is used in DISCERN in more than one literal way. Whereas trading off time for space lets DISCERN bind a sequence of tokens together, storing instances of structures in a spatially organized map allows the system to bind a value (such as the name of an agent) to a slot (such as the role performed by the agent in the script). As noted in the book, this capability is not completely general, in that the slot itself can only be an instance of a number of predefined slot types, determined by the system's prior experience (i.e., by the stories to which it has been previously exposed).
21. Although this is considered to be a limitation of the system, it is not altogether clear whether full-scale language processing does require an unlimited capability of constructing abstract structures through binding (see the discussion in Henderson and Marcus, 1994). Moreover, simple experiments with the perception of visual shapes made up from unusual components indicate that the same limited but useful binding "trick" used by DISCERN may be related to basic perceptual principles. Specifically, people are able to perceive familiar shapes in visual patterns even when these are built of components that are neutral in that they do not really belong to the whole (as in the hammer and sickle formed by athlete bodies in a revolution anniversary parade in a Communist country), or parts that are downright bizarre (as in the painting "The Four Seasons" by Giuseppe Arcimboldo, showing four human heads composed of fruits, vegetables, tree trunks, etc.; NOTE #2).
22. The perception of a coherent whole in a collection of parts which do not naturally belong together is presumably possible because of the perceiver's prior familiarity with the entire shape, and because the parts do resemble, at some level, the constituents of the true whole, as it is represented in memory. This assumption is supported by considering a blurred version of, say, one of the portraits by Arcimboldo: in such an image, the face-like quality of the whole persists, although the details (fruits, tree trunks, etc) become unrecognizable.
23. To summarize, the potential of the instance-based approach to structure processing in all branches of cognition is worth experimental exploration, as indicated by the developments in case-based reasoning (Stanfill and Waltz, 1986) and learning (Aha et al., 1991). In vision, there are promising preliminary results (Edelman, 1994), although much research is still needed to fulfill this promise. In language, too, this approach is still in its infancy. The greatest challenge of instance-based systems such as DISCERN lies in the processing of extensive corpora of normal text and in coping with nonsense sentences and sentences partly composed of "correctly" inflected nonwords. Hence this approach should ultimately be judged by its scalability up to real-life parsing tasks, and by its performance in various borderline cases of parsing which serve as the litmus test of the flexibility of the human language processor.
24. It is a matter of consensus that prior experience with strings like "I sleep quietly" and "In May the leaves are light green" affects our comprehension both of regular sentences about sleep and about leaves, and of nonsense like "Colorless green ideas sleep furiously". In orthodox transformational grammar (TG) linguistics, whose focus is on human language competence, the "prior experience" is believed to be represented in terms of explicit abstract rules. Developments in several branches of cognitive science (notably, in vision research) suggest alternative theories of competence in structure processing, some of which center on the idea of instance-based representations. Because one can decide which competence theory is a better theory of the brain only by embodying it in a model of performance, the really interesting question about language is, ultimately, empirical: to what extent can human performance be matched by a system that is neither a Universal Turing Machine (required by the TG theories) nor a connectionist approximation thereof.
25. Because they foster a view of the cognitive system as a black box that follows abstract rules, theories of the Turing Machine variety are notoriously difficult to test: the universality of the computational mechanism they postulate allows processing complexity to be traded off for the complexity of representation. When considered in isolation from issues of performance and of mechanism, the question "Are rules of syntax explicitly represented in the human language system?" can only lead to fruitless debates, such as the great imagery controversy that raged in cognitive science about twenty years ago (Pylyshyn, 1973). In fact, one of the consequences of the black box approach to cognition was the claim that the issue of representation is, in principle, empirically undecidable (Anderson, 1978).
26. The hope of resolving the issue of structure representation in the brain stems from combining theoretical (computational) considerations with hard data from experimental psychology and neurobiology. Indeed, some of today's more controversial issues in understanding brain function concern, appropriately enough, basic neurobiological questions (such as whether neurons are spike integrators or detectors of spike coincidences), and, on the other hand, they have direct and crucial implications for theories of cognition that rely on binding mediated by spike timing (Shadlen and Newsome, 1994).
27. In conclusion, I would like to propose that at the present stage in the study of cognition it is more productive to concentrate on unraveling the mechanisms that comprise the cognitive system and on finding theoretically meaningful patterns in their performance than to argue about disembodied competence. If your computational model performs right and if you know exactly how it does it (as is the case with DISCERN), you can develop it into a testable theory built on a solid foundation of facts. On the other hand, if you develop a theory of competence that feels right, but consistently runs into trouble under the constraints of the available computational substrate, then the theory is irrelevant as a model of performance, no matter how elegant it is. In biological information processing, the only available computational substrate is, basically, a highly interconnected network of simple units with graded overlapping receptive fields, arranged in maps linked by patchy projections; if a theory of competence cannot easily cope with this constraint, too bad for the theory.
#1. Following is an excerpt from a book chapter by L. Henderson expressing a similar view regarding rules and regularities in language:
Much of the linguistic ingenuity that has gone into constructing elegant and compact descriptions of the lexical knowledge the language user is required to have takes the form of law-like statements [...]. It seems unreasonable to deny that such organized systems of rules and their qualifiers offer a cogent description of the regularities of a language, but it is clearly a quite different matter to establish that this form of representation is one that is felicitously matched to the characteristics of the human processor [...]. People exhibit skill in solving problems in a variety of domains which are governed by formal, abstract rules. Until quite recently it has seemed natural, perhaps even unavoidable, to assume that they succeed to the extent that their behavior is controlled by mental representations of these abstract rules. However, attempts to establish the psychological reality of these rules have been largely unsuccessful in areas as diverse as syllogistic reasoning and transformational syntax (from Henderson, 1989, pp. 382-383).
#2. An electronic version of Arcimboldo's painting mentioned in paragraph 21 can be obtained by anonymous ftp from eris.wisdom.weizmann.ac.il (126.96.36.199), in directory /pub/Art. A preprint of (Edelman, 1994) can be found on the same host, in /pub/mam.ps.Z.
Adini, Y., Y. Moses and S. Ullman (1993) Face recognition: the problem of compensating for changes in illumination direction, Weizmann Institute of Science CS-TR 93-21.
Aha,, D.W., D. Kibler and M.A. Albert (1991) Instance-based learning algorithms, Machine Learning, 6:37-66.
Ajjanagadde, V. and L. Shastri (1991) Rules and variables in neural nets, Neural Computation, 3:121-134.
Anderson, J.R. (1978) Arguments concerning representations for mental imagery, Psychological Review 85:249-277.
Biederman, I. (1987) Recognition by components: a theory of human image understanding, Psychol. Review 94:115-147.
Braitenberg, V. (1977) On the texture of brains, Springer-Verlag, New York.
Carroll, J. and T. Bever (1976) Sentence comprehension: a case study in the relation of knowledge and perception, in Language and Speech, vol. 7 of Handbook of Perception, E. Carterette and M Friedman, eds., New York: Academic Press.
Cheng, K., T. Hasegawa, K. Saleem, and K. Tanaka (1994) Comparison of neuronal selectivity for stimulus speed, length, and contrast in the prestriate visual cortical areas V4 and MT in the macaque monkey, J. of Neurophysiology, 71, in press.
Edelman, S. (1994), Representation, Similarity, and the Chorus of Prototypes, Minds and Machines, in press.
Finch, S. and N. Chater (1991) A hybrid approach to the automatic learning of linguistic categories, manuscript, available via ftp from archive.cis.ohio-state.edu as /pub/neuroprose/finch.hybrid.ps.Z
Henderson, J. and M. Marcus (1994) Description Based Parsing in a Connectionist Network, U. of Pennsylvania TR IRCS 94-12.
Henderson, L. (1989) On mental representation of morphology and its diagnosis by measures of visual access speed, pp. 357-391, in Lexical Representation and Process, W. Marslen-Wilson, ed., Cambridge, MA : MIT Press.
Hummel, J.E. and I. Biederman (1992) Dynamic binding in a neural network for shape recognition, Psychological Review 99:480-517.
Miller, G. (1962) Some psychological studies of grammar, American Psychologist 17:748-762.
Miikkulainen. R. (1993) Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon and Memory. Cambridge, MA: MIT Press.
Miikkulainen. R. (1994) Precis of: Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon and Memory. PSYCOLOQUY 5(46) language-network.1.miikkulainen.
Omohundro, S.M. (1987) Efficient algorithms with neural network behavior, UIUCDCS report R-87-1331, Univ. of Illinois at Urbana-Champaign.
Poggio, T. and S. Edelman (1990) A network that learns to recognize three-dimensional objects, Nature 343:263-266.
Poincare, H. (1913/1963) Mathematics and Science: Last Essays, translated by J. W. Bolduc, New York: Dover.
Putnam, H. (1988) Representation and reality, MIT Press, Cambridge, MA.
Pylyshyn, Z. (1973) What the mind's eye tells the mind's brain: a critique of mental imagery, Psychological Bulletin 80:1-24.
Shadlen, M.N. and W.T. Newsome (1994) Noise, neural codes and cortical organization, Current Opinion in Neurobiology 4:569-579.
Snippe, H.P. and J.J. Koenderink (1992) Discrimination thresholds for channel-coded systems, Biological Cybernetics 66:543-551.
Stanfill, C. and D. Waltz (1986) Toward memory-based reasoning, Comm. of the Assoc. for Computing Machinery 29:1213-1228.
Treisman, A. (1985) Preattentive processing in vision, Computer Vision, Graphics, and Image Processing 31:156-177.
Weiss, Y. and S. Edelman (1993) Representation with receptive fields: gearing up for recognition, Weizmann Institute of Science CS-TR 93-09.