is not clear whether congruence in itself (and hence ease of association)
influences reliability.
Somewhat surprisingly, very few trials seem to be needed to obtain
a minimum reliability of .60 (e.g., approximately 30 trials appear
to suffice for a well-designed GNAT). There is, however, some variability
between blocks in each design and the weakest blocks often
required 10–20 extra trials more than the average to match the target
reliability. Interestingly, there is no effect of target-attribute congruence.
Congruent blocks are not noticeably more reliable. Given the
observed variability, researchers need to carefully consider whether
the overall or blockwise reliability is important in their designs.
General discussion
What can we say about the reliability of the GNAT in general? It
must be emphasized that a test's reliability is a function of its structure,
format, and content. It would seem absurd to expect a pencil
and paper personality test of randomly generated items to have reliability
as good as, say, the NEO-PI-R (Costa & McCrae, 1992). Similarly,
we cannot expect a GNAT for a randomly selected construct
comprising untested items to have a high reliability simply because
it is a GNAT. However, in contrast to the gloomy view of GNAT reliability
held by some, the present paper shows that GNATs can achieve
good reliability and, given the wide range of constructs examined, we
feel confident that our results represent likely values of GNATs for
other constructs. Based on our results a figure of 30–40 trials per
block is indicated as a rough starting point for creating a GNAT with
“acceptable” reliability (r>.60–.70) and 80–90 trials per block is
likely to yield very good (r>.80) reliability. It seems that reliabilities
of .90 are very hard to achieve; the number of trials required is
extremely large and likely to be burdensome to participants. A
GNAT could be more precisely designed by collecting pilot data with
40–50 trial blocks and then using the MCALC procedure to estimate
the appropriate block length for a chosen level of reliability.
The results of Study 2 indicate that the reliability coefficients from
split-half estimates (cf., Study 1) underestimate the reliability of fulllength
blocks. Comparison of results from Study 1 and 2 suggests that
split-half estimates should be revised upwards by 10–20% depending
on the construct and block length and, while there is no reason that
the Spearman–Brown prediction formula should apply to correcting
split half reliability estimates of GNAT d′ scores, our results indicate
that in practice it is a good empirical approximation. However,
using pilot data to simulate the reliability of different block lengths,
as we have done in Study 2 is preferable and far more defensible. R
source code that researchers can use in their own studies is available
from the authors.
We have presented a logical argument for why widely used
methods (such as the desirable Cronbach's alpha) cannot be directly
applied to GNATs, the relative merits of existing methods, and why
standard corrections cannot be applied to split half estimates of
GNAT reliability. We have advanced a conceptual argument for a
statistic which should be a good reliability indicator and have also
shown how empirical distributions of reliability estimates can be
used to interpret reliability estimates derived from various split-half
methods.
Reliability considerations in designing GNATs and diagnosing GNAT
problems
Until a more complete solution to the problem of GNAT reliability
is discovered we recommend the following approach for designing
GNATs and assessing their reliability. Firstly, calculate split-half reliabilities
using odd/even and first-half/second-half splits, and obtain
the distribution of a large number of random split half reliability
estimates (e.g., RaSSH). If odd/even reliability and the RaSSH mean
are similar, researchers should have confidence that the reliability of
the GNAT is not unduly influenced by practice effects, but must
recognize that both statistics underestimate the true reliability.
Researchers can then use the MCALC, or at a pinch, the Spearman–
Brown formula as a guide to the true reliability (i.e., corrected for
test length). Researchers designing new GNATs could use the
MCALC on pilot data to estimate the block length required to achieve
a given level of reliability. A tight RaSSH or MCALC distribution can be
considered evidence against GNAT sensitivity to both sequence and
item-sampling effects and is also consistent with (but does not guarantee)
that the GNAT is measuring a single construct. We have not yet
investigated “rational” GNAT design, calculating reliability statistics
for different item combinations to identify poor versus good items.
Should the above conditions not hold, GNAT designers can consider
the following points in diagnosing GNAT problems:
1. Investigate carefully instances where the odd/even and RaSSH mean
reliability estimates differ significantly. Since the odd/even split balances
practice and fatigue in the two halves, one might expect
these halves to show higher correlations than randomly chosen
splits. In the GNATs we studied that the odd/even estimate was
generally slightly lower than the RaSSH mean. Because the RaSSH
averages out random fluctuations and item sampling effects,
RaSSH means larger than the odd/even correlation most likely indicate
that random and item sampling effects have a bigger impact
on GNAT score consistency than learning effects. The reverse
would be true where the odd/even reliability is greater than the
RaSSH mean.
2. Use the same block length for all blocks that will be compared. Choose
block length carefully, particularly when large differences in d′ are
observed between the first-half and second-half of a block or the
first-half/second-half reliability differs markedly from the RaSSH
mean. The d′ for the second half of GNAT blocks was generally
higher than that of the first in all our designs, indicating learning
effects. Large differences between d′ for the first and second half
of a block or large differences between reliability estimates derived
from first/second half splits compared with other methods should
alert researchers to a critical dependency of d′ magnitude and
consistency on block length. These undesirable effects should
generally decrease with increasing block length. Although we
have never seen designs that use different block lengths, the
present study provides a strong case for always using equal block
length to avoid spurious between-block differences in d′, particularly
for short blocks.
3. Examine RaSSH variability for an indication of item quality and
sequence effects. If the RaSSH distribution shows high variability
or is platykurtic, block length and item characteristics need to be
examined. Even if the odd/even and RaSSH means agree, high
RaSSH distribution variability for short blocks (e.g., less than 40 trials)
indicates that blocks may be too short to yield stable scores.
Wide RaSSH dispersion for long blocks or a platykurtic RaSSH distribution
indicates undesirable levels of item-sampling variability
and suggests that items are of poor quality, are heterogeneous, or
are tapping multiple constructs. Strong local sequence effects
could also be indicated by high RaSSH variability coupled with a
leptokurtic RaSSH distribution, where a few particular trial combinations
yield extreme values. RaSSH distributional anomalies
coupled with large discrepancies between different reliability estimators
should be considered particularly problematic.
Summary and conclusion
Tests used in social and clinical psychology are required to come
with some statement about their reliability, without which they are
interpreted with suspicion at best, and simply not used at worst.
Our results indicate that GNATs can be reliable and that simple alternating
item split-half correlation provides a usable estimate of