NHST is a useful tool for communication among researchers at the frontiers of scientific knowledge. I contend that inferential statistics are used primarily as descriptive landmarks in negotiating uncertain terrain, and that fashioning suitable null hypotheses can be nontrivial. Momentum effects in tennis are served as an example. Research advances both by both magnifying and isolating effects, and p values are valuable benchmarks in this endeavour. An increased focus on the locus of effects at the level of the individual is desirable. These recommendations are orthogonal to the abstract logic underlying NHST and do not diminish its utility.
2. Unlike Krueger, I do not perceive significance testing (ST) to be the main villain, although it is unwitting accomplice. The arch enemy, in my view, is the inadequacy of null hypotheses (NH). Significance testing, it must be admitted, only operates on the sampling distribution derived from the given NH. Overall, there is much to commend NHST (e.g., the clarity with which it states a question by using a simple contrast); its abuse can usually be traced to inadequate hypotheses which are flawed for reasons that are orthogonal to the NHST procedure.
3. In the false consensus example, no one would dispute that the zero point illustrates very little, especially after the first dozen studies had established the phenomenon. The Bayesian induction model is more interesting, as it provides 33% as the maximum permissible difference in the self versus other consensus estimates (Dawes, 1989). At the very least, it draws attention to the possibility that some projection is better than none and led to empirical findings which demonstrate that there is such a thing as over-projection (Krueger & Clement, 1994).
4. Nevertheless, null hypotheses, even clearly false ones, are useful because in addition to conclusions restricted to the study in which they are used, they provide a common benchmark for comparisons across studies. In neuroimaging studies, researchers use z-statistics to compare image maps in experimental and control conditions, with zero based NHST. It is well known that the interpretation of this statistic is problematic (e.g., spatial and temporal autocorrelation across voxels), but it is widely used, and contributes to the discourse, and advance research in this community. Depending on the study, researchers choose to use a sufficiently high value of the threshold to separate signal from noise (e.g., z > 3.5). There are other measures, such as time course data, which can reinforce the validity of image maps obtained from repeated application of non-independent z tests (e.g., Chee et al., 2000). The point is that, in this context, a putative inferential statistic functions as a descriptive statistic, a guide to the interesting patterns in data.
5. One might counter that neuroimaging is the exception and that, in fact, most cutting-edge research involves proper application of formal statistics. I flatly deny this. Inferential statistics, in its purest form, can be relegated to population census, poll surveys, physical measurements by organisations such as the National Institute of Standards Technology, and applied contexts such as those examining mean time between failures for electronic components and agricultural research. In these contexts, the constructs are highly constrained, so that there is no ambiguity about the statistic one wishes to measure to some pre-specified accuracy. The researcher estimating the expected life of AA Energiser batteries is not in the least concerned whether the results will generalise to a comparable Duracell.
6. In the physical sciences, results that diverge from mathematical models by small amounts can have disastrous consequences for the model. But, given the presence of such exquisitely constrained models, inferential statistics is, at best, a handmaiden and can hardly lay claim to being Delphi. We thus come to the curious conclusion that statistical machinery such as NHST is used least formally in precisely the very domains which are most in need of conceptual precision. The machinery of inferential statistics provides the researcher with valuable descriptive signposts in poorly defined territory, and also aids in communication among research workers. This is natural, but can also lead to misuse. An example from recent experience is used to illustrate the importance of adequately specified null hypotheses.
7. The hot hand in basketball phenomenon was undermined when it was found that, contrary to sporting folklore, the probability of a successful shot attempt was not positively related to the outcome of the previous attempt (Gilovich, Vallone, & Tversky, 1985). In 1998, I saw a conference poster that described the presence of "momentum" effects in tennis. Using a sample of 22 tennis matches played by juniors (on average, each match lasted for a total of 125 points), it was found that, on average, winners had 8.9 streaks versus 5.3 for the losers (p < .0001; streak was a run of 3 or more points in the match). Across matches, the number of winner and loser streaks were positively correlated (r's around .7) with the number of points won by both the winner and the loser. Winners typically win more points than losers. Thus, winner points was adopted as a statistical control in an ANCOVA contrasting winner and loser streaks (total points could also have been used as a covariate). The effect persisted, at the p < .05 level, leading to the implication that winners were more streaky than losers. I will show that this application of NHST is problematic, not because of the inadequacy of NHST per se, but due to the inadequacy of the specific null hypothesis which was used.
Figure 1. COMPUTER SIMULATION OF WINNER STREAKS. Mean number of red streaks as a function of the number of red balls present in an urn with 100 balls.
8. The researchers assumed that the number of streaks is linearly related to points won, and ignored other aspects of tennis match structure. The linearity assumption is generally false, as illustrated by a computer simulation of the number of red streaks which involved drawing 100 balls, without replacement, from an urn containing red and blue balls (see Figure 1). Each point is based on the mean of 1000 observations. The assumption is tenable if we restrict ourselves to the middle of the range of the values because most tennis matches tend not to be completely lop-sided in terms of points won and lost. Even then, it should be recognised that the slope and range of the linear part could vary across matches that have different numbers of total points won. In effect, across matches, it cannot be assumed that there is a common linear relationship relating winner (or total points) to the number of streaks. The above simulation is overly simplistic because it ignores the effects of the total number of points; in addition it ignores the actual structure of tennis; for example, a legal tennis game cannot be assigned exactly 7 points. The strategy I adopted involved constructing distinct null hypotheses that were tailored to the structure of each individual match.
Figure 2. NULL HYPOTHESIS FOR WINNER STREAKS IN MATCH 5. Distribution of winner streaks across 10,000 simulated matches in which the total winner and loser points were matched to the corresponding statistics of match 5.
9. After obtaining the raw data from the researcher, I constructed models of tennis which derived possible matches consistent with the number of winner and loser points and the rules of tennis. For example, if there were 90 winner points and 75 loser points in a 2-set match, the program would construct possible 2-set matches that were consistent with this match. The sum of the absolute difference between the simulated winner and actual winner points, and the absolute difference between the simulated loser and actual loser points, was constrained to be less than 5. 10,000 simulated matches were constructed for each real match. The sampling distributions of both the winner and loser streaks were ascertained for 20 matches (these discrete distributions were symmetric and bell-shaped; see Figure 2). For each match, the probabilities of results equal to or greater than the winner and loser streaks were ascertained. The table below displays these p values for 20 matches. One would be hard put to claim that either winners or losers were streaky (if one considered the sampling distribution of the difference between winner and loser streaks; only the second match provides a p < .05 difference). On the contrary, it appears that quite a few winners and some losers were less streaky than expected.
Table 1. The tail probabilities of the null hypotheses for winner and loser streaks across 20 matches.
Match Winner P(X>=Winner) Loser P(X >= Loser)
1 10 0.8372 9 0.6150 2 12 0.3438 5 0.9946 3 9 0.5140 3 0.8527 4 8 0.7949 4 0.6345 5 6 0.9844 6 0.0913 6 9 0.7077 7 0.3157 7 10 0.2696 7 0.1983 8 8 0.7499 8 0.0528 9 8 0.7389 3 0.8013 10 8 0.9636 6 0.9562 11 7 0.9450 5 0.4306 12 9 0.5582 3 0.9405 13 8 0.8415 5 0.5034 14 9 0.3710 4 0.0864 15 6 0.9130 0 1.0000 16 7 0.9154 3 0.7089 17 10 0.4624 4 0.9204 18 9 0.9888 9 0.7148 19 10 0.2837 3 0.8852 20 11 0.2075 7 0.3362
10. We must conclude, contrary to the trend in the ANCOVA, that these data do not support the streak hypothesis. Matches with identical conditionals as the data can be extremely streaky, at least with reference to the null hypotheses. This absolves the null hypotheses from the criticism that they impose exacting standards which are difficult to exceed. By now, it should be apparent that group-level analyses are inappropriate to this problem.
11. It is possible that tennis may not be streaky in general but that a particular player might be so, consistently across matches, or under particular conditions. Some players who are not streaky in general tend to be so when playing each other. It is apparent that tennis can be streaky at the set-level, as exemplified by the various comeback wins (e.g., the Andre Agassi versus Andrei Medvedev 1999 French Open final), and sometimes at the game-level (e.g., a 0-6 set followed by 6-0, 6-0). Perhaps the most informative unit of analysis in tennis is neither the set nor the point but the game. A more sophisticated model could also take into account the natural advantage that accrues to the server in tennis. With the right model, one can explore data in a much richer way, using a variety of statistics and eventually construct a satisfying theory. Far more than is commonly realised, much of the meat of theory is encapsulated in null hypotheses.
12. We should, as many scientists do, use statistics as heuristics that reveal interesting landmarks meriting further study. As a scientist, I am relieved if only a couple of big F ratios stand out among the many possible Fs in a 6-way ANOVA. The Fs and their significance are obviously conditional on the (a) procedures and materials, (b) sample, and (c) operationalisation of the dependent variable and the resultant impact on sample statistics. If aspects of the procedure which are orthogonal to theory are responsible for effects, then these are much less interesting (unless the new variables themselves are revealing). Similarly, alternative operationalisations of the dependent measure can lead to varying effect sizes; clearly this is outside the scope of NHST. Given that researchers iterate their study designs by observing preliminary data patterns, it would be absurd to claim that the inferential statistics used are pristine. Nevertheless, this is a useful strategy. The ultimate test of a phenomenon is its reproducibility using well-specified procedures and methods. Like Greenwald and his colleagues (1996), my opinion is that the humble p value would be much more useful if researchers offered exact p values and eliminated the troublesome inequality. Ceteris paribus, a p value of 1 in 10 billion is far more convincing than a p value of 1 in 20.
13. In a preliminary study, the detection of an effect which barely crosses the .05 threshold may be cause for interest, but it is little use if the effect size is not magnified and made more consistent in subsequent studies. Even if 100 studies, each with a p value of .05, follow the first study, this is not necessarily conclusive. As the file-drawer problem indicates, publication bias in research cannot be wished away, and makes these probabilities highly suspect (Scargle, 1999). One should be critical and ask why further research did not eliminate sources of error (in participants, methods or materials) to obtain greater consistency. From my point of view, giving exact p values, together with F ratios and MSE (in the context of ANOVA) is quite adequate. The emphasis on the advantages of confidence intervals ignores the plain fact that, in the social sciences and other domains in which construct operationalisations tend to be fast and loose, p values are better tools for communication than confidence intervals. While researchers have not settled on Bayesian procedures for adjusting posteriors given the data, their actual behaviour may reflect just that. Surely the p values and effect sizes obtained influence the choice of task paradigms and questions pursued subsequently. Perhaps every researcher is a closet Bayesian, waiting for the day when it becomes acceptable to parade in the language of priors and posteriors.
14. Hence NHST as such is not the problem. If researchers accept pale null hypotheses and do not strive to sharpen their substantive tools, statistical practice is hardly to blame. Having said that, one aspect of NHST is usually overlooked in critiques. Rarely, if ever, does one come across critiques of group-level analyses, in which the sample mean is accorded the kind of reverence reserved for sacred phallic symbols. As researchers, should we not be curious about the individual ? If an effect is got, say at p = .01, what percentage of the sample shows it conclusively? Many studies can support statistical tests within on individuals, across items, especially when multiple observations are obtained from each person. If 36 of 40 people show a statistically significant effect in the same direction, then the overall F ratio is redundant. The one person who shows a significant effect in the opposite direction is cause for curiosity (Sriram, 1999). Researchers in neuroscience, where individual analyses are common, routinely treat the person as the focus of analysis. In a recent paper, we found that one right-handed person had inverted language activations, and that individual is now the subject of further investigation (Chee et al., 1999). The same approach could yield rich dividends in social psychology. Idiographic approaches, as practised by Krueger (1998b) and Stanovich and West (1998), are bound to be illuminating. Some of the reverence accorded to the mean could also be transferred to measures of variability within individuals (e.g., Sriram & Lee, 1999). Hall (1999) shows that the SD does not capture basic intuitions concerning variability, and this insight may eventually lead to the deployment of alternative sample statistics.
15. Turning away from NHST, it is debatable whether the excessive focus on negativism is desirable. Although one may not approve of the lack of an overall positive glow in all of this research, it is not difficult to defend it. As McCauley (1998) points out, the 20th century's record in man's inhumanity against man is well documented. Therefore, it is hardly surprising that social psychologists have been fascinated with evil, and more broadly, human failings. Most would agree that the knowledge painstakingly accrued by research programs such as Milgram's (1983) has added considerably to our view regarding human nature. The same could be said of the considerable body of work illustrating human frailties in judgement and decision making (e.g., Dawes, 1988).
16. Research into human abilities and all that is good in us is already available in abundance, although it may not be categorised as such. Leaving aside work on topics such as altruism and love, the enormous strides that are being made in the psychophysics and neuroscience of perception and cognition leave one in awe of the cognitive machinery which all of us have inherited and use at every moment (Baars, 1997). Consequently it is not surprising that workers in artificial intelligence (Rickert, 1998) tend to have a positive view of human abilities and choose to segregate rationality from intelligent behaviour. However, like Krueger (1998b), I find the construct of creativity to be slippery, and consider Rickert's equating creativity with intelligence unsatisfying. Whatever the limitations of rationality, it can at least be defined. Unlike McCauley, I do not consider an information processing viewpoint to be limiting; on the contrary it may be too powerful. Current information processing models accommodate both hot emotions and cold cognitions (Damasio, 1998). The decimation of millions of people in this century is an effect of the interaction of technology with ideology. Rational thinking may be orthogonal to genocidal evil, but it is an unsubstantiated hypothesis. For milder, chronic, evil, a strong case can be made for the normativeness of rational thinking.
Baars, B. (1997). In the theater of consciousness : The workspace of the mind. Oxford University Press.
Chee, M. W. L., Caplan, D., Soon, C. S., Sriram, N., Tan, E.W. L, Thiel, T, & Weekes. B (1999). Processing of visually presented sentences in Mandarin and English studied with fMRI. Neuron, 23:127-137.
Chee, M. W. L, Sriram, N, Soon C. S., Lee K. M (2000). Dorsolateral prefrontal cortex and the implicit association of concepts and attributes. Neuroreport (in press).
Damasio, A. (1995). Descartes' error : Emotion, reason, and the human brain, Avon Books.
Dawes, R. M. (1988). Rational Choice in an Uncertain World. Harcourt Brace.
Dawes, R. M. (1989) Statistical criteria for a truly false consensus effect. Journal of Experimental Social Psychology 25: 1-17.
Gilovich, T., Vallone, R. & Tversky, A. (1985). The hot hand in basketball: On misperceptions of random sequences. Cognitive Psychology, 17, 295-314.
Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996) Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology 33: 175-183.
Hall, M. J. W. (1999). Universal geometric approach to uncertainty, entropy and information. Phys. Review. A. (in press) http://xxx.lanl.gov/abs/physics/9903045
Krueger, J., & Clement, R. W. (1994) The truly false consensus effect: An ineradicable and egocentric bias in social perception. Journal of Personality and Social Psychology 67: 596-610.
Krueger, J. (1998a) The bet on bias: A forgone conclusion? PSYCOLOQUY 9(046) ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/1998.volume.9/ psyc.98.9.46.social-bias.1.krueger http://www.cogsci.soton.ac.uk/cgi/psyc/newpsy?9.46
Krueger, J. (1998b) On the perception of social consensus. Advances in Experimental Social Psychology 30:163-240.
McCauley, C. (1998) The bet on bias is cockeyed optimism. PSYCOLOQUY 9(071) ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/1998.volume.9/ psyc.98.9.71.social-bias.9.krueger http://www.cogsci.soton.ac.uk/cgi/psyc/newpsy?9.71
Milgram, S. (1983). Obedience to Authority. HarperCollins.
Rickert, N. W. (1998). Intelligence is not rational. PSYCOLOQUY 9(051) ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/1998.volume.9/ psyc.98.9.51.social-bias.3.rickert http://www.cogsci.soton.ac.uk/cgi/psyc/newpsy?9.51
Scargle, J. D. (1999). Publication bias (the "file-drawer problem") in scientific inference. http://xxx.lanl.gov/abs/physics/9909033
Sriram, N., & Lee, I. (1999). The multiplicative effects of context switching, congruity, predictability and dominance in speeded semantic classification. (submitted for publication.)
Sriram, N. (1999). The human ingroup bias as revealed by the implicit association test. (Unpublished data.)
Stanovich, K. E., & West, R. F. (1998). Individual differences in rational thought. Journal of Experimental Psychology: General, 127: 161-188.