Philip A. Higham (1996) Measuring Recall Performance. Psycoloquy: 7(38) Witness Memory (13)

Volume: 7 (next, prev) Issue: 38 (next, prev) Article: 13 (next prev first) Alternate versions: ASCII Summary
PSYCOLOQUY (ISSN 1055-0143) is sponsored by the American Psychological Association (APA).
Psycoloquy 7(38): Measuring Recall Performance

Commentary on Memon & Stevenage on Witness-Memory

Philip A. Higham
Department of Psychology
University of Northern British Columbia
3333 University Way
Prince George, B.C.

Wayne T. Roberts
Royal Canadian Mounted Police Detachment
999 Brunswick Street
Prince George, B.C.


Higham and Robert's (1996) position on measures of performance in the cognitive interview (CI) is clarified in light of Fisher's (1996) commentary. Also, percent correct, and a measure of sensitivity derived from signal detection theory, are compared for two hypothetical interviewees whose response output is varied.


Cognitive interview, errors, eyewitness memory, facilitated recall, police procedures, questioning, recovered memories, structured interview.


1. Fisher's (1996) reply to Memon and Stevenage's (1996) target article on the cognitive interview addresses the important issue of how best to measure recall performance. The purpose of this commentary is to clarify our (Higham & Roberts, 1996) position on this issue and to contrast percent correct and other accuracy measures of memory performance using a simulation.


2. According to Fisher, both Memon and Stevenage (1996) and Higham and Roberts (1996) have claimed that "...the CI leads to less accurate recollection than do the Structured or Standard Interviews" (paragraph 6). He states that these claims are erroneous because they are based on the "wrong statistic" (paragraph 7); namely, we examined the number of errors rather than the proportion of errors. However, contrary to Fisher's assertion, we did not state that the cognitive interview leads to "less accurate recollection" than either the Structured Interview (SI) or the Standard Interview. Our position was, and still is, that there is no meaningful difference in accuracy between the SI and the CI, but that there are significantly more statements (accurate and inaccurate) with the CI. This is stated clearly in paragraph 2 of our commentary: the "increase in overall output, without an associated increase in accuracy, is reminiscent of the effect that hypnosis is theorized to have on memory recall." We did, however, claim that the interview was of "low efficacy," which we still believe to be true. In our view, the sheer fact that the accuracy rate of the CI differs so little from the SI, given that the CI was developed presumably to enhance recall, substantiates this claim.


3. Fisher argues that the only useful measure of recall performance is percent correct because it equates for differences in total number of responses (paragraphs 8, 9 and 10). We disagree with Fisher's claims on both qualitative and quantitative grounds. On the qualitative side, interview performance can be assessed at various levels with various statistics, some of which are sensitive to overall output without concern for accuracy (see Goldsmith and Koriat, 1996, for discussion). Which measure or measures are most appropriate is dependent on the goal of the interview and the dimensions on which the interview is being assessed. For example, during the initial part of an investigation of a crime, an interview that produces many statements may be preferable to one with only a few, even though the accuracy is the same (see Goldsmith & Koriat, 1996, for a similar observation). At this stage, obtaining as many leads as possible might be the goal of the interview. Conversely, in a courtroom setting, an interview containing a large absolute number of errors might be used to argue that a particular witness or victim is unreliable, despite the fact that the accuracy of the interview might be high. Under these circumstances, a low output interview might be preferable to one with high output. Clearly, there are many factors to be considered when assessing the worth of a particular interviewing technique. To the extent that these factors are to be measured, indices other than percent correct must be considered.


4. We concur with Fisher that determining accuracy, corrected for overall output, is important. However, we are not convinced that percent correct is the only, or even the best, metric for this purpose. As a thought experiment, consider a signal-detection model in which a person experiences certain mental events that vary along a dimension of familiarity while trying to retrieve information during an interview. The person might then set a criterion along this familiarity dimension such that mental events that are above it are reported, whereas those below it are not. Accordingly, a reported event that is accurate is a "hit," whereas a reported incorrect statement is a "false alarm" (fa). Percent correct, then, is calculated as (hits/[hits + fas]).

5. With recall data, it is impossible to determine the number of "correct rejections" (crs; mental events below the criterion that would have been inaccurate if reported) or the number of "misses" (mental events below the criterion that would have been correct if reported). Consequently, one cannot determine hit and fa rates (hits/[hits + misses]; fas/[fas + crs], respectively) which are normally necessary to calculate measures of sensitivity and bias in signal detection theory. Nonetheless, it might be informative to generate values for misses and crs for two hypothetical subjects, one with a liberal criterion (CI) and another with a conservative criterion (SI), and to compare percent correct scores with a sensitivity measure. A' (Donaldson, 1992), a nonparametric measure of sensitivity, was chosen for the simulation, with the corresponding nonparametric measure of bias (B"). The total number of mental events each subject experienced was held constant (40), as was the total number of mental events that were true memories (hits + misses = 25) and the total number that were not (fas + crs = 15). However, the number of mental events that were reported (i.e., above the criterion) was greater for CI (30 statements; 20 correct, 10 incorrect) than for the SI (9 statements; 6 correct, 3 incorrect). Data from these hypothetical subjects are shown below.

                 hits   fas  misses  crs   %Correct     A'    B"
Criterion: Liberal (CI) 20 10 5 5 66.7 .64 -.78 Conservative (SI) 6 3 19 12 66.7 .55 +.85

6. A few things are worth mentioning about these data. First, note that the bias measure (B") reflects the more liberal criterion for the CI (-.78) than the SI subject (.85). Note also that the percent correct measure is identical between the two types of interviews (66.7%). This pattern of differing bias, with a fairly constant percent correct between the SI and CI, is a result which is typical in the literature. However, despite the comparability of the percent correct scores, the measure of sensitivity (A') indicates that the CI subject was better able to discriminate memories from other kinds of mental events (.64) than was the SI subject (.55). In other words, the CI and SI differed not only in terms of overall output, but also in effectiveness. However, percent correct did not reflect this difference in sensitivity.

7. Note that the purpose of the simulation is not to make any statements about what is generally true with the CI and/or the SI. Depending on what kind, and how many, unreported mental events an interviewee experiences (i.e., misses and crs), different values of sensitivity will be obtained. Nor is the purpose of the simulation to suggest A' as a measure of recall sensitivity rather than percent correct; as already mentioned, with recall data, there is not enough data available to perform traditional signal detection analysis. Rather, the simulation was designed to demonstrate that percent correct should not necessarily be considered the "gold standard"; it is only one of many measures that can be chosen to index memory performance. Given the differences between percent correct and sensitivity measures in the simulation, we believe that there is good reason to try to develop alternative measures of recall performance, and that Fisher's emphasis on percent correct is unnecessarily limiting.


Donaldson, W. (1992). Measuring recognition memory. JOURNAL OF EXPERIMENTAL PSYCHOLOGY: GENERAL, 121, 275-277.

Fisher, R.P. (1996). Misconceptions in Design and Analysis of Research with the Cognitive Interview. PSYCOLOQUY 7(35) witness-memory.12.fisher

Goldsmith, M. & Koriat, A. (1996) The Assessment and Control of Memory Accuracy. PSYCOLOQUY 7(23) witness-memory.9.goldsmith.

Higham, P.A., & Roberts, W.T. (1996). Analyzing States of Consciousness during Retrieval as a way to Improve the Cognitive Interview. PSYCOLOQUY 7(17) witness-memory.4.higham.

Memon, A., & Stevenage, V.S. (1996). Interviewing Witnesses: What Works and What Doesn't? PSYCOLOQUY 7(6) witness-memory.1.memon.

Volume: 7 (next, prev) Issue: 38 (next, prev) Article: 13 (next prev first) Alternate versions: ASCII Summary