Measures of the Magnitude of DIF (Effect size)
Two points are noteworthy at this juncture. First, as per usual in statistical
hypothesis testing, the test statistic should accompanied by some measure of the
magnitude of the effect. This is necessary because small sample sizes can hide interesting
statistical effects whereas large sample sizes (like the ones found in typical psychometric
studies) can point to statistically significant findings where the effect is quite small and
meaningless (Kirk, 1996). Second, I endorse the advice of Zumbo and Hubley (1998)
who urge researchers to report effect sizes for both statistically significant and for
statistically non-significant results. Following this practice, with time the psychometric
community will have amassed an archive of effects for both statistically significant and
Theory and Methods of DIF 27
non-significant DIF and therefore we can eventually move away from the somewhat
arbitrary standards set by Cohen (1992).
Measuring the magnitude of DIF follows, as it should, the same strategy as the
statistical hypothesis testing except that one only works with the R-squared values at each
step. Zumbo and Thomas (1997) indicate that an examination of both the 2-df Chi-square
test (of the likelihood ratio statistics) in logistic regression and a measure of effect size is
needed to identify DIF. Without an examination of effect size, trivial effects could be
statistically significant when the DIF test is based on a large sample size (i.e., too much
statistical power). The Zumbo-Thomas measure of effect size for R2 parallels effect size
measures available for other statistics (see Cohen, 1992).
For an item to be classified as displaying DIF, the two-degree-of-freedom Chisquared
test in logistic regression had to have had a p-value less than or equal to 0.01 (set
at this level because of the multiple hypotheses tested) and the Zumbo-Thomas effect size
measure had to be at least an R-squared of 0.130. Pope (1997) has applied a similar
criterion to binary personality items. It should be noted that Gierl and his colleagues
(Gierl & McEwen, 1998, Gierl, Rogers, and Klinger, 1999) have adopted a more
conservative criteria (i.e., the requisite R-squared for DIF is smaller) for the Zumbo-
Thomas effect size in the context of educational measurement. They have also shown
that the Zumbo-Thomas effect size measure is correlated with other DIF techniques like
the Mantel-Haenszel and SIBTEST hence lending validity to the method.
In summary, I have found that a useful practice is to compute the R-squared effect
for both (a) uniform DIF, and (b) a simultaneous test of uniform and non-uniform DIF.
This strategy is useful because one is able to take advantage of the hierarchical nature of
DIF modeling and therefore compare the R-squared for uniform DIF with the
simultaneous uniform and non-uniform DIF to gage a sense of the magnitude or nonuniform
DIF. The examples will demonstrate this approach.