DISCUSSION
The last research question examines the relationship between peer- peer interaction outcome and different contexts and research methods. With respect to context variables, language setting, institutions, and the number of L1s were found to moderate the effect. It should be noted that studies generally obtained larger effects in universities than in intensive language programs, which might be partially attributed to the possible proficiency differences. It is possible that university- level English learners generally perform better in English interaction assessment than those who are studying English in a language program. Also noteworthy is that studies which only select interaction from candidates of the same L1 received larger effects. The result indicates that L1 variation should also be considered when designing peer-peer interaction studies in English speaking assessment.
Several method variables were also found to moderate the effect. As discussed earlier in the literature review section, it is necessary to examine the relationship between study quality and outcome (Plonsky, 2011b). One important observation was made on the proficiency standards adopted by primary studies. Studies which combined multiple proficiency scales (e.g., standardized test results, in-house test results, length of language learning, and impressionistic judgment by researchers or instructors) obtained larger effects than those that simply used standardized test scores. This can be interpreted as indicating that combined proficiency scales can reliably place participants into appropriate levels to reflect the real magnitude of the difference between two groups. These findings indicate that it is probably more reliable to use multiple means to determine candidates’ proficiency level. Additionally, the results of research question three suggest that the proficiency variable seems to have effects on peer-peer interaction. This further accentuates the importance of adequately placing candidates into an appropriate proficiency level.
It is worthwhile to point out another observation of method variables: that of reliability reporting practice. The findings seem to be in contrast to the common assumption that studies reporting coder- reliability, rater-reliability, or rubrics produce larger effect sizes than those studies which do not report any reliability measures. One possible reason might be the contribution of other factors, which can affect the statistical power (e.g., sample sizes and the choice of significance level). For instance, only a limited number of studies reported coder or rater reliability in this meta-analysis. Moreover, the sample sizes for these studies were relatively small. However, it is noted that results of studies with reliability estimates had low dispersions or variations, which demonstrated the effect of reliability control on study quality. In brief, the findings can serve as evidence to claim a potential relationship between methodological features of primary studies and the interaction outcome they observe.