In his review of Subsymbolic Natural Language Processing (Miikkulainen, 1993, 1994), Edelman (1994) makes several useful analogies between language processing and vision. His main argument is that approaches based on common information processing principles in the brain, such as maps and receptive fields, are more likely to lead to insights into human cognition. I very much agree with this idea, and discuss a few concrete ways in which language processing models can benefit from principles in use in current visual processing models.
1. At first glance, finding common principles and architectures for visual and linguistic processing models seems quite difficult. Although the organization, and to large extent the function, of biological vision systems is well understood, especially at the low levels, the constraints on linguistic performance come mostly from top down. It is difficult to ground high-level linguistic performance in low-level processing structures. For example, even if we know that semantic associations of words are helpful in understanding relative clause structure (Stolz, 1967), it is difficult to see how this constraint would translate into architectures based on maps and receptive fields. Eventually it may be possible to do this, but in the current state of the art, we often have to make do with intermediate-level models that we hope retain most of the functional properties of neural implementations, such as soft constraint satisfaction, noise tolerance, and information storage based on correlations in the input.
2. This is the case with the DISCERN story-processing system (Miikkulainen, 1993, 1994) as well. The memory modules of DISCERN (episodic and lexical) were most naturally implemented as maps, and they are perhaps architecturally the most plausible components of DISCERN. On the other hand, the language-processing modules (parsers, generators, and question-answering networks) were constrained by high-level language phenomena such as script-based inference, and must be seen as abstractions. Perhaps later it will be possible to redesign them in terms of more low-level architectures; however, even at this intermediate level we can gain valuable insights into how language processing may be carried out in the brain.
3. In any case, the insights provided by Edelman (1994) on using visual processing principles in natural language processing models are intriguing. I will briefly expand on three of them below: segmentation, deep structure, and retinotopy.
4. Edelman points out that segmenting a visual input image is similar to parsing complex sentences. Both are greatly aided by external cues such as color and common motion (in vision) and prosody and semantics (in language). Perhaps the clearest example in the language domain is processing relative clauses. It has been observed in normal adults as well as in children and aphasics that the interpretation is much easier when the constituents are strongly correlated semantically (Stolz, 1967; Huang, 1983; Caramazza and Zurif, 1976). For example, it is easier to understand "The dog that bit the girl barked" than "The boy who saw the girl laughed."
5. We have recently modeled this process in the SPEC parsing system (Miikkulainen and Bijwaard, 1994), which is an extended version of DISCERN's parser. In addition to the parser network, which maps sequences of words into a case-role representation, SPEC has a RAAM network (Pollack, 1990) that serves as a working memory for embedded constituents, and a controller network that has a high-level view of the parsing process and passes representations between parser and the memory. When the performance of the system is degraded by adding noise to the memory, the semantic effects become clearly visible. The stronger the semantic coupling between constituents, the fewer errors SPEC produces. For example, SPEC would be much less likely to err above in "The dog... barked" than "The boy... laughed". This result emerges directly and automatically from the tendency of distributed neural networks to store information in terms of correlations. The syntactic and semantic constraints are both regularities in the input, and the network learns to use both in disambiguating the sentence. Similarly in the visual system, it seems, the recognition process has encoded the fact that color, texture, and common motion are regularities that often indicate a common object, and uses them as soft constraints in segmentation and binding.
6. Representing and learning structure is perhaps the hardest problem facing the current connectionist approaches. The networks can represent only surface-level correlations, such as pixel-by-pixel similarity of two images of faces, rather than whether they represent the same person under different illumination conditions, as Edelman points out.
7. In some cases it may be possible to hand code the network so that it learns to represent deeper structure as well. Edelman gives an example of using a bank of center-surround receptive fields in vision. Similarly, in the SPEC system described above, the task is broken into parsing one clause at a time, with external memory, thereby enabling the system to generalize to new input structures. However, still missing is a mechanism for discovering the high-level structure automatically. The current connectionist techniques learn regularities at the level of their input representations only; they cannot discover regularities in structure, build a representation for that structure, and then use it to recognize and represent future inputs.
8. One possible way out of this impasse, as Edelman implicitly seems to suggest, is through the study of the self-organization in the visual system. Artificial neural network models already exist that show computationally how receptive fields, feature detectors, and higher-level structures such as gestalt principles can be discovered based on correlations in the visual input (Marshall, 1990; Sirosh and Miikkulainen, 1994). Perhaps soon it will be possible to use some of the same ideas in other domains of connectionist modeling such as natural language.
9. Edelman suggests that retinotopy, or, more generally, space-based representations, could be a key to representing linguistic structure. This idea is indeed utilized in the DISCERN model. The memory representations for stories are laid out on feature maps "retinotopically", based on similarities in their representation vectors. With hierarchies of such maps it is possible to represent similarities at different levels, and even abstractions to a limited extent.
10. It might be possible to carry this idea even further in language processing. Perhaps the individual clauses constitute a level of primitives that could be represented this way on maps. Such a map would lay out the possible case-role representations, and a collection of active clauses on the map could represent the information expressed in a complex sentence. Mechanisms of binding and segmentation known from the visual system could be used in building coherent structures on the map, and higher-level maps could form abstractions from several sentences, similar to the way the visual system abstracts from low-level features. Such an architecture is highly speculative at this point, but it does form a possible avenue for future research in language processing, especially if one subscribes to Edelman's view that ultimately the visual and linguistic processes must be based on similar low-level implementations.
Caramazza, A. and Zurif, E.B. (1976). Dissociation of Algorithmic and Heuristic Processes in Language Comprehension: Evidence from Aphasia. Brain and Language, 3:572--582.
Edelman, S. (1994). Biological Constraints and the Representation of Structure in Vision and Language. PSYCOLOQUY 5(57) language-network.3.edelman.
Huang, M.S. (1983). A Developmental Study of Children's Comprehension of Embedded Sentences with and without Semantic Constraints. Journal of Psychology, 114:51-56.
Marshall, J.A. (1990). Self-Organizing Neural Networks for Perception of Visual Motion. Neural Networks, 3:45-74.
Miikkulainen, R. (1993). Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. Cambridge MA: MIT.
Miikkulainen, R. (1994). Precis of: Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon and Memory. PSYCOLOQUY 5(46) language-network.1.miikkulainen.
Miikkulainen, R. and Bijwaard, D. (1994). Parsing Embedded Clauses with Distributed Neural Networks. In Proceedings of the 12th National Conference on Artificial Intelligence, 858-864. San Mateo, CA: Morgan Kaufmann. (expanded version available by anonymous ftp from cs.utexas.edu: pub/neural-nets/papers/miikkulainen.subsymbolic-caseroles.ps.Z)
Pollack, J.B. (1990). Recursive Distributed Representations. Artificial Intelligence 46:77-105.
Sirosh, J. and Miikkulainen, R. (1994). Cooperative Self-Organization of Afferent and Lateral Connections in Cortical Maps. Biological Cybernetics, 71:65-78. (ftp: cs.utexas.edu:pub/neural-nets/papers/ sirosh.cooperative-selforganization.tar)
Stolz, W.S. (1967). A study of the Ability to Decode Grammatically Novel Sentences. Journal of Verbal Learning and Verbal Behavior, 6:867-973.