To measure the degree to which our two codings of student responses about gravity agreed, we used a matrix
similar to Table 1 to calculate the Cohen’s Kappa inter-rater reliability statistic, j, for each question. Inter-rater
reliabilities greater than 0.80 are generally accepted as good agreement (Landis and Koch 1977). If the categorizations
for any question resulted in a j below 0.80, discussions and clarifications were made in an iterative process
until acceptable agreement was reached, but no explicit information was exchanged as to which responses
each researcher placed in each category