Furthermore, the consistently larger scores on the second half of each
block are evidence of substantial learning effects. Researchers are
urged to consider practice and block length effects in their GNAT
designs.
While some very high and very low correlations were observed at
the extremes of the RaSSH distributions, the distributions were close
to symmetric about the mean (and median) and displayed relatively
small spread. This pattern of results indicates that while using a single
split to calculate reliability could result in substantial over- or underestimates,
a single random split will generally yield an unbiased estimate
of the mean of all split halves.
Perhaps surprisingly, the odd/even splitting method yielded very
similar reliability estimates to the RaSSH averages, although overall
the RaSSH estimates were marginally higher and varied less within
a study. While an odd/even split could result in substantial error (as
evidenced by the tails of the RaSSH distribution), these results suggest
that this has not occurred for the datasets under consideration.
It seems that the odd/even split is an acceptable method, but the
RaSSH should be preferred, being less variable and more robust to
sampling anomalies.
As expected, reliability varied as a function of GNAT content and
block length. With the exception of the “fruit–good” and “old–good”
blocks, the RaSSH reliability estimates were all above the
“acceptable”.60 cutoff and the average reliability across all datasets
was good, with some specific designs having very good reliability. It
is worth noting that the bugs–fruit data yielded one of the lowest
overall reliabilities, which leads to the rather unusual situation of
paradigm's exemplar being less strong than its applications. This
may be because other studies assessed categories (e.g., black and
white faces) which both belong to a higher-level category (e.g.,
race). This simplifies the task because it involves a judgment on
only one relevant dimension (De Houwer & De Bruycker, 2007).
Whatever the reason, the results indicate that topic and stimuli
contribute considerably to the reliability of a GNAT. These results
give confidence that a well-designed GNAT for measuring a clearly
defined construct can have good reliability.
Two block difference scores
It has long been known (e.g., Lord, 1963) that difference scores
generally have lower reliabilities than their component scores,