5. Results and Discussion
In this study, three different IRT-based DIF methods were used to determine items which functions differently with respect to gender of students in TIMSS 2011 mathematics subtest and results were compared. In addition, item purification was performed for each method in order to see how item purification effected the number of DIF items and DIF statistics compare results of these methods. Comparing findings from different methods can provide insights into whether differences are due to the different assumptions and criteria embedded within the methods. Moreover, convergent findings across methods are more likely to prompt content experts to modify or remove items with consistent DIF of high magnitude (Yang et al., 2011). Results indicated that two items (m2, m12) were identified as DIF items by all three methods, whereas 12 other items were never identified as such. For four items (m2, m8, m12 and m16), the Lord's Chi-square and Raju’s Area methods identified them as DIF, but the other methods did not. On the other hand, m19 item was detected as DIF item by only LRT methods. Although, almost all items detected as DIF with three different methods were in favor of male students, Raju’s signed area method with item purification indicated that item 8 and item 21 were in favor of female students rather than male students with respect to mathematics subject. Performing item purification with Lord's Chi-square and Raju’s Area methods effected both the number of DIF items and DIF items themselves. However, Performing item purification with LRT method did not affect the number of items detected as DIF. According to the results, Lord's Chi-square method tended to be more sensitive than other two methods with respect to detecting DIF items. On the other hand, even item purification was performed, LRT method failed to detect many items detected as DIF items by other methods. As it is assumed, these three IRT-based techniques showed substantial agreement in the detection of DIF among the same set of mathematics subtest items, but vary in the number of items flagged with DIF due to different assumptions and criteria used. This has been a theoretical review of possible IRT-based DIF methods that can be used with a dichotomously scored large scale mathematics test. Although, number of items that displayed DIF differed because of different criteria being used by different methods, it is also important to examine the item carefully in order to try to explain why the item displays DIF.