Kenneth M. Ford and Patrick J. Hayes (1996) The Turing Test is Just as bad When Inverted. Psycoloquy: 7(43) Turing Test (7)

Volume: 7 (next, prev) Issue: 43 (next, prev) Article: 7 (next prev first) Alternate versions: ASCII Summary
Topic:
Article:
PSYCOLOQUY (ISSN 1055-0143) is sponsored by the American Psychological Association (APA).
Psycoloquy 7(43): The Turing Test is Just as bad When Inverted

THE TURING TEST IS JUST AS BAD WHEN INVERTED
Commentary on Watt on Turing-Test

Kenneth M. Ford and Patrick J. Hayes
Institute for Human and Machine Cognition
University of West Florida
http://www.coginst.uwf.edu/~kford/
http://www.coginst.uwf.edu/~phayes/

kford@ai.uwf.edu phayes@cs.uiuc.edu

Abstract

Watt discusses some of the problems with the original Turing Test (Watt, 1996), but he misses the central ones. His "inverted" test (where the machine plays the role of the judge) is even more vulnerable to all the criticisms of the original Test, and provides no clear conceptual advantage or insight. Similarity to human behavior is not a sensible criterion for intelligence. As we have argued elsewhere in some detail (Hayes & Ford, 1995), although the Turing Test had a historical role in getting our subject started, it is now a burden, damaging AI's public reputation and its own intellectual coherence. It is time for AI to consciously reject the naive anthropomorphism implicit in all such "imitation games," and adopt a more mature description of its aims.

Keywords

False belief tests, folk psychology, naive psychology, the "other minds" problem, theory of mind, the Turing test.
1. Watt discusses some of the problems with the original Turing Test (Watt, 1996), but he misses the central ones. The Turing Test is fundamentally flawed for two reasons: it is a basically poor experimental design, and it tests for the wrong thing. A dog which played chess well enough to occasionally defeat Kasparov would be considered a remarkably intelligent dog, but it wouldn't stand a chance in the Turing Test. Similarity to human behavior is not a sensible criterion for intelligence. Watt's "inverted" test (where the machine plays the role of the judge) suffers from the same problems.

2. The basic snag is the use of any kind of human-imitation criterion for intelligence. To see why, note only that, as Turing (1950) emphasizes, in order to pass the test a machine would have to be careful not to seem too good at some things which humans are rather poor at, such as mental arithmetic. (Watt seems to assume that the machine is obliged to tell the truth about, for example, its bodily appearance, but this is clearly not what Turing had in mind. In fact, the imitation game is an explicit test of the ability to lie, cheat, and deceive.) The winner of the Loebner competition, for example, deliberately made spelling mistakes every now and then, "noticed" them a few keystrokes later, and backspaced to correct them, all at simulated human typing speeds. This kind of trickery would be central to AI if we really took Turing's test seriously. It would not be about artificial intelligence so much as mechanical sophistry. But apart from silliness-producing contexts like the Turing Test, someone who was especially good at mental arithmetic would normally be considered to be more, not less, intelligent. In fact, as an engineering goal, we want our machines to be more and differently talented than we are, so they can help us get things done. If a desk calculator lost track of long columns of numbers and muttered to itself about carrying tens, we would trade it in for a better one.

3. Watt considers this issue in a rather diffuse way by talking of "alien intelligence," that is intelligence of a kind not exhibited by human beings. His conclusion is that even if this existed, we wouldn't be able to recognize it. We think his conclusion is wrong (even if one could not recognize the methods used as intelligent, their effect -- the behavior -- would surely be recognizable as resulting from the application of intelligence, and so it could be measured); but the point here is not that the machine might be alien in some unrecognizable way, but rather that it might exhibit an inhuman disparity of human intellectual talents. It might be superb at scientific induction from large bodies of evidence, but useless at fireside chitchat; or it might be a better than anyone known at making rapid judgements of likely creditworthiness, but unable to count to five. The machines we can already build are like idiot savants in this way, and they are remarkably useful. Computational machines of this kind are best understood as cognitive prostheses: they extend the power of our minds rather in the way that tools like backhoes and steam shovels have extended and amplified the power of our muscles. This is where AI is being most useful and having the most important effects on society, and all manner of Turing-type tests are completely irrelevant.

4. Watt is correct to observe that our tendency to attribute intentionality to things that behave or look like us is a likely source of bias in the original Turing Test, at least in the simple version which is usually understood. Turing's own paper is rather ambiguous with respect to whether the Turing Test is meant to be taken as a "species test" - where the judge is asked to distinguish members of their own species from mechanical imposters -- or the more subtle "gender test" which requires a computer to play the role of the man in the original imitation game. The gender version then has a woman and a machine each trying to convince the judge that they are a woman, and the judge's task is to decide which is the woman and which, therefore, is not. Notice that this judge is still thinking about the differences between women and men, not humans and machines. In this version, a machine can fail either by being discovered as inhuman or, more subtly, by being an unsuccessful imitator of a woman. The gender version of the Test escapes many of Watt's criticisms, since the judge here is not expected to be thinking about what makes behavior seem human or not.

5. Finally, the utility of the proposed "inverted" test is simply unclear. As Watt states, it is completely impractical as an actual experiment. But what do we learn from considering it as a thought-experiment? Turing, writing almost half a century ago, suggested his Test as an actual research goal. One of his aims was to defuse pointless philosophical speculation about the exact nature of intelligence, and for that purpose the Test has been a notable failure. Understood as the simple "species" version, the Turing Test is now widely taken to be simply a rather fancy way of stating that the goal of AI is to make an artificial human being. But now, with the benefit of hindsight, this seems like a basic error whether ones aim is scientific or engineering. (In the first case, it amounts a kind of Frankenstein experiment, a notoriously poor experimental design; in the second case, who needs it?) Watt's inverted test seems to amount only to the suggestion that we should take not human ability to lie convincingly, but human ability to recognize human behavior, as our criterion and goal. Since this is one task at which humans are notoriously poor -- which indeed is why the Turing Test is faulty, according to Watt -- it seems that the "inverted" test is even more vulnerable to all these criticisms, and provides no clear conceptual advantage or insight.

6. In particular, Watt (correctly) points out that if the "species" version of the Test were run repeatedly, the criteria used by the judges to make their judgements of human-like intelligence would change. This has indeed already occurred: in the 1950's, mental language was often used of electronic computers simply by virtue of their ability (astonishing at the time) to perform arithmetic. However, given this fact, the inverted Turing Test would seem to be even less stably defined than the original Turing Test, since the criteria which the machines must use, as well as their behavioral goals, must be in a state of constant flux. Even if it could even be attempted, it is not clear that the resulting scenario would even have a stable limit.

7. Turing's paper (1950), "Computing Machinery and Intelligence" inspired the creation of the field of Artificial Intelligence, gave it a vision, a philosophical charter and its first great challenge. The Turing Test, in one form or another, has been with AI ever since, and is still often used to define the field. However, as we have argued elsewhere in some detail (Hayes & Ford, 1995), although the Turing Test had a historical role in getting our subject started, it is now a burden, damaging AI's public reputation and its own intellectual coherence. It is time for AI to consciously reject the naive anthropomorphism implicit in all such "imitation games," and adopt a more mature description of its aims.

REFERENCES

Hayes, P.J. & Ford, K.M. (1995) Turing Test Considered Harmful, Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-95), pp. 972-977, Montreal.

Popple, A.V. (1996) The Turing Test as a Scientific Experiment. PSYCOLOQUY 7(15) turing-test.2.popple.

Turing, A.M. (1950) Computing Machinery and Intelligence. Mind, 59: 433-460.

Watt, S.N.K. (1996) Naive Psychology and the Inverted Turing Test. PSYCOLOQUY 7(14) turing-test.1.watt.


Volume: 7 (next, prev) Issue: 43 (next, prev) Article: 7 (next prev first) Alternate versions: ASCII Summary
Topic:
Article: