Mark Hallahan (1999) The Hazards of Mechanical Hypothesis Testing. Psycoloquy: 10(001) Social Bias (13)

Volume: 10 (next, prev) Issue: 001 (next, prev) Article: 13 (next prev first) Alternate versions: ASCII Summary
PSYCOLOQUY (ISSN 1055-0143) is sponsored by the American Psychological Association (APA).
Psycoloquy 10(001): The Hazards of Mechanical Hypothesis Testing

Commentary on Krueger on Social-Bias

Mark Hallahan
Department of Psychology
Clemson University
Clemson, SC 29634-1511


Researchers often use null hypothesis significance testing without considering important issues such as statistical power and whether statistical tests' underlying assumptions fit their theory and data. This commentary discusses how these issues relate to research on lay perceptions of streak shooting. It is suggested that researchers may better understand their phenomena explicitly trying to model them, consciously recognizing the inherent biases and limitations of their methods and choosing methods flexibly to fit the specific attributes of their data and theory.


Bayes' rule, bias, hypothesis testing, individual differences probability, rationality, significance testing, social cognition, statistical inference
1. Krueger (1998) has argued that existing research often portrays human judgment as biased and irrational and that this unflattering view is partially due to how null hypothesis significance testing [NHST] is used [e.g., see Chow 1998 -- ed.]. It is problematic to define irrationality in terms of statistical significance (namely, that statistically significant deviations from some normative value imply irrationality). NHST can lead to unproductive, dichotomous thinking about research results: the null hypothesis is or is not rejected, irrationality is or is not demonstrated. As Krueger notes, it is more informative to focus on the size of effects and to identify functions that explain how people make judgments rather than merely to demonstrate that bias exists.

2. Using NHST mechanically, without considering its underlying assumptions and inherent biases, exacerbates the problems Krueger (1998) discussed. Two troublesome practices associated with NHST are using statistics whose underlying assumptions do not fit the theoretical model for one's data (Gonzalez, 1994; Loftus, 1996) and neglecting statistical power (Cohen, 1988). For illustration, I will discuss how these issues could be handled differently in research on lay perceptions of "streak shooting" (e.g., Gilovich, Vallone & Tversky, 1985). Observers believe that basketball players have shooting "streaks," or sequences in which the percentage of shots made deviates from their typical performance. The subjective perception of streak shooting is thought to reflect biased judgment because the pattern in actual shot sequences rarely differs significantly from chance. However, if the assumptions that underlie NHST do not adequately model what happens when people shoot basketballs, NHST may not be well suited to assess how closely actual shot sequences approximate a random process.


3. Shot outcomes in a truly random sequence should be independent of each other. In a long random sequence, the serial correlation (r-ser), or the correlation of a shot's outcome with that of the preceding shot, should be zero. In shorter random sequences, the obtained values of r-ser will deviate substantially by chance and should be normally distributed around zero for a sample of shooters. However, I examined distributions of r-ser from two samples of actual shot sequences and both were distributed bimodally. Fewer shooters had near-zero values of r-ser and more had extreme positive and negative values than would be expected in a normal distribution. In one sample, twenty-six college basketball players each took 100 shots (Gilovich et al., 1985, study 4). If r-ser were normally distributed around zero, 40% of shooters would be expected to have r-ser between -.026 and .086, but only 23% actually fell in that range (chi-square(1) = 3.10, p = .08). In the other sample, the first round of 1998 NBA long distance shooting competition [FOOTNOTE 1], eight players each took 25 shots. In a normal distribution around r-ser = .00, 40% should have r-ser between -.055 and .184, but only 13% did (chi-square(1) = 2.52, p = .11). This trend is statistically significant when these two small samples are combined meta-analytically (Z = 2.37, p = .009, 1-tailed). Also, in both samples, the majority of shooters had negative values of r-ser (54% and 63% respectively). Apparently, r-ser is not normally distributed around zero, casting doubt on the previously unquestioned assumption that shooting basketballs is a random process.

4. Alternative models might better describe what actually happens when people shoot basketballs. For example, there may be systematically different types of shooters. Assume that approximately 60% of people are more likely to hit a shot following a missed shot (call them "deliberate shooters"); perhaps missing cues them to concentrate more on subsequent shots. The remaining 40% ("streak shooters") are more likely to hit following a successful shot. Further assume, for argument's sake, that r-ser is normally distributed within each of these groups, with deliberate shooters' average r-ser = -.10 and streak shooters' average r-ser = .15 [FOOTNOTE 2]. Although this hypothetical model of the shot process would be problematic for NHST, Bayes' Theorem can be used appropriately to estimate the conditional probability that an observed shot sequence came from either a streak or a deliberate shooter.

5. Imagine a hypothetical sequence of 26 shots in which a shooter made 67% of shots following a made shot and 40% following a missed shot; r-ser = .27 for this sequence, which is not significant, p = .19 (2-tailed). Bayes' Theorem requires the prior probability that a given shooter is a streak or deliberate shooter. Based on our model, there is a 40% chance that any given shooter is a streak shooter, P(S) = .40, and a 60% chance that any given shooter is a deliberate shooter, P(D) = .60. Also needed is the probability of obtaining the observed shot sequence conditional on what type of person is shooting, P(r-ser >= .27|S) and P(r-ser >= .27|D). A sequence with r-ser >= .27 would occur 27.8% of the time from a streak shooter and 3.8% of the time from a deliberate shooter (i.e., 1-tailed significance levels for the extent to which a sample r-ser = .27 differs from our assumed population values, r-ser = -.10 and r-ser = .15, respectively). The conditional probability that a person shooting a 26 shot sequence in which r-ser >= .27 comes from the streak shooter sub-population, P(S|r-ser >= .27), is obtained with Bayes' Theorem:

    = [P(S)P(r-ser>=.27|S)]/
    = [(.40)(.278)/[(.40)(.278)+(.60)(.038)] 
    = .828   (Equation 1)

Similarly, P(D|r-ser >= .27) is obtained by

    = [P(D)P(r-ser>=.27|D)]/
    = (.60)(.038)/[(.40)(.278)+(.60)(.038)]
    = .172   (Equation 2)

In other words, a 26 shot sequence in which r-ser >= .27 is nearly five times more likely to come from a streak shooter than a deliberate shooter. By NHST's standards, the degree of dependence between shots is not large enough to be statistically significant. However, a strong preponderance of evidence supports the judgment of streak shooting based on Bayesian analysis and a different model of the shooting process. People certainly do not perceive chance flawlessly. However, our understanding of when and to what extent lay judgments are inaccurate may partially depend on our choice of methods and the (often unrecognized) biases inherent in those methods. By appropriately considering alternatives to NHST, researchers may better understand how lay judgments are made.


6. Another important difference between NHST and alternative methods is the criterion used to determine that a shot sequence is a nonrandom streak. Though arbitrary, NHST's ubiquitous alpha = .05 convention is often used without consideration for how it affects statistical power. Unless sequences are long or the degree of dependence between shots is strong, conventional NHST will probably have poor statistical power. For example, in a population where the actual degree of dependence between shots is r-ser = .27, a significance test for r-ser from a 26 shot sequence would have power = .265 (assuming alpha = .05, 2-tailed). Here NHST fails to detect a truly nonrandom dependence 73.5% of the time.

7. The fact that people perceive a nonrandom streak even though the observed degree of dependence between shots in a sequence is not large enough to be "statistically significant" may reflect their use of a more lenient inferential standard than NHST's alpha = .05 -- one that more evenly balances type I and type II error. Because the alpha = .05 convention has no inherently rational basis, using a more lenient standard cannot be considered irrational. It is important to remember that NHST can be strongly biased in favor of the null hypothesis. The fact that human judgment differs from conclusions based on NHST should not automatically be taken to imply irrationality.


8. This commentary does not claim that people perceive chance with perfect accuracy. There is substantial evidence that they do not (Gilovich, et al., 1986; Tversky & Kahneman, 1971; Wagenarr, 1972). Instead, it suggests that using NHST mechanically may lead researchers to overlook its inherent limitations and biases, thereby impairing conceptual understanding. Rather than blindly assuming all data to fit NHST's underlying assumptions, researchers should explicitly try to model the phenomena under investigation. Such efforts will provide insight about the phenomena and may help to identify when alternative methodologies are appropriate. Besides being assumption-dependent, conclusions based on significance levels are often erroneous and biased on the nature of their errors (e.g., the prevalence of type II error over type I error). Researchers should recognize that NHST, with its alpha = .05 convention is biased against rejecting the null hypothesis, often failing to obtain statistical significance when real effects actually exist. To define irrationality in terms of statistical significance is to assume without justification that lay judgment uses a similarly strict criterion.

9. Some points made in this commentary are speculative and put forth for the sake of argument. For example, based on so few data, it would be impossible to know whether the suggested alternative model of the shooting process (i.e., 40% streak shooters and 60% deliberate shooters) better represents what happens when people actually shoot basketballs. Similarly, no direct evidence supports the claim that people use a more lenient criterion than ? = .05 to declare a sequence a nonrandom streak. Nevertheless, there is conceptual value in trying to model substantive phenomena and consciously choosing methods to fit those models, rather than mechanically using NHST without considering its appropriateness for one's data.


[1] These data are available at: Thanks to Alan Reifman for finding this website.

[2] The near-perfect correlation between the first and second round r-ser's for the four players advancing to the second round of the 1998 long distance shooting contest (r = .986) supports the idea that "streak" and "deliberate" shooting may be a stable individual tendency.


Chow, S. L. (1998). Precis of "Statistical significance: Rationale, validity and utility." Behavioral and Brain Sciences, 21, 169-240.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Gilovich, T., Vallone, R. & Tversky, A. (1985). The hot hand in basketball: On misperceptions of random sequences. Cognitive Psychology, 17, 295-314.

Gonzalez, R. (1994). The statistics ritual in psychological research. Psychological Science, 6, 321, 325-328.

Krueger, J. (1998c). The bet on bias: A forgone conclusion? PSYCOLOQUY 9(46)

Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161-171.

Tversky, A. and Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.

Wagenarr, W. A. (1972). Generation of random sequences by human subjects: A critical survey of the literature. Psychological Bulletin, 77, 65-72.

Volume: 10 (next, prev) Issue: 001 (next, prev) Article: 13 (next prev first) Alternate versions: ASCII Summary