Qualitative content analysis of the interview data
The qualitative content analysis (Mayring, 2000) of the interview material shows, according to the teachers, a rather heterogeneous – at different dimensions and levels – development of the “mathematical reasoning” competence during the one-month project. Nevertheless, every student learned that claims must be supported by reasoning and arguments. Students stated that reasoning was more than simply making a calculation because an explanation was needed, too. Students described the reasoning and argumentation competence by the use of certain expressions, such as “because it is” “therefore it is” or “why it is”. An argument appears sound when it is logical, simple and easy to follow, according to the students. It also has to be extensive. Further, the teachers believed that the students realised that the mathematical reasoning competence is complex and that achievement can differ depending on dimension and level. The students understood that judging reasoning tasks as “false” or “right” is not adequate and that “less understandable” or “well comprehended” is far more appropriate. Most of the students, but not all, claimed to have become more competent in reasoning during the project period. The mixing of task difficulty and competence development remains an unsolved problem. Sometimes a student thinks he or she gained competence because he or she learned to tackle more difficult reasoning tasks. Analysing the student's answers revealed that learning to reason is not easy for the children, but it is feasible. In addition, some of the difficulty is within the format of the task and not part of mathematical reasoning in general. Additionally, the teachers believed that the using the rubric led for them to a deeper understanding of the topic, and helped with content-knowledge and the clarification of the learning goals for the students. This relates positively to our research question about the transparency of the rubric.
According to the teachers, every student made progress in self- and peer-assessment in mathematical reasoning with the help of the rubric. The teachers stated that self- and peer-assessment was easier for the students than ameliorating their own work. The project time was too short for some children to learn how to use the rubric for self-regulated working. While adaptations in the dimensions “correct computations” and “illustrations” were easily undertaken, those for the “procedure” and the “argumentation” dimensions caused more problems, especially for children with lower language skills. Both students and teachers stated that the rubric served as a checklist during and after working on a task. The students also viewed the rubric as a time-consuming tool. They realised, however, that this handicap was primarily due to the reasoning tasks and not because of the rubric itself. Further, students understood that rubrics are not needed for every type of task, yet they indicated a desire to use rubrics for appropriate tasks in other subjects as well. Rubrics seem to enhance motivation, according to the children, because they encourage students to aim at reaching a higher level, or even the highest level. Students indicated feeling more secure using the scoring grid because they knew which competence level they achieved and what knowledge and skills they possessed.
The criteria of the rubric based on the standard were clear and valid to the teachers. The teachers felt the instrument supported their feedback to students. Compared to usual mathematical tasks, it took more time to assess student work because they had to follow the reasoning of each child to evaluate competence. This extra time was due to the mathematical reasoning competence, not caused by the rubric as a tool. The rubric was helpful in teacher and student discussions about the student's work and encouraged the teacher to ask the student to explain his or her argument. This cannot be done for every piece of work because there is not enough time during regular lessons. However, when the children were busy doing seatwork, some teachers found time do this for every child once during the period of the study. Sharing the criteria of the evaluation with students made it easier for the children to accept the teacher's score. Some teachers reported to have discussed the criteria of the evaluation based on prepared examples with the class as a whole. Overall, the teachers claimed to be able to construct such a standards-oriented rubric themselves if they were given enough time. According to the teachers, every teacher should try to develop a rubric at least once. A teacher could learn to adapt pre-constructed rubrics from schoolbooks or the school community. Ideally, according to teachers, constructing a rubric should be done in collaboration with experts in the field.
Discussion
One of the aims of our project was to construct a quality assessment tool for standards-oriented competencies. As an example, we chose the mathematical reasoning competence, which is difficult to measure. If teachers have an instrument to assess such a complex competence, they might put more weight on mathematical reasoning in the classroom, and the competence might become more significant for the students because they recognise that they will be tested on mathematical reasoning. To evaluate our instrument, we looked at three important criteria of quality: reliability, validity and transparency. In addition to these three criteria, Moskal (2003) adds that a rubric should be based on clearly defined learning goals, which in our case were the national standards. She notes that performance assessments allow students to demonstrate complex competencies. Although we used primarily written work, it would be possible to apply the rubric to oral work as well. Regarding the reliability and validity of the rubric, we focused more on the achievement test than occasional formative assessments by the teachers embedded in the ordinary classroom.
Concerning reliability, we achieved good measures for our items and the construction of the instrument's four levels. Due to the time-consuming reasoning tasks, our test did not present reliable measures of students’ competence yet; more solved items are needed. However, decisions in the classroom made on the basis of an assessment can easily be changed if they appear to be wrong. Consequently, reliability for the students’ competence measures is not as crucial as it is for large-scale assessments, where there is no turning back (Black, 1998). Therefore, when the assessment has relatively low-stakes, lower levels of reliability may be acceptable (Jonsson & Svingby, 2007). Further, the two raters who judged the test items experienced no problems in reaching sufficient interrater reliability employing the scoring grid. Additionally, intrarater reliability was supported by the rubric when the teacher was using formative assessment to enhance student learning. The instrument allowed the teacher to maintain consistency while assessing different students’ work. Teachers should be aware that sometimes qualities of a student's work are not recognised when the assessor is too focused on the scoring criteria of the rubric (see e.g. Hull, Kuo, Gupta, & Elby, 2013). A student might receive an invalid low score based on the traits of the rubric and not of the performance.
The analysis of the interview data showed a high agreement between experts, teachers and students about the content of mathematical reasoning as presented in our rubric, which supports the validity of the instrument. The levels of reasoning, and how to use reasoning appropriately, are clear to all students. Hence, the rubric contributes considerably to transparency in teaching and evaluations. Some students needed nearly the entire study period to achieve a full understanding of the rubric.
Working with open-ended tasks seems to be appropriate to obtain valid measures for complex competencies, such as mathematical reasoning. This is concordant with experiences from the Trends in International Mathematics and Science Study (TIMSS) (Adams & Gonzalez, 1996). If we are interested in criterion-based statements about a student's reasoning competence, more than one task is required. As we learned from the interviews, the format of the item affects the difficulty; therefore, several items are necessary for a valid assessment of the mathematical reasoning concept. Considering the questions of Moskal and Leydens (2000) regarding the examination of the validity of the rubric itself, we note that the interpretation of the student's performance still requires the teacher's judgement. As an example, the difficulty of a task's format is not included in our rubric because this would exceed the scope of the instrument. Teachers need to explain why some tasks are more difficult, such as those with longer or more complex texts, more than one logical step necessary to solve the problem or unfamiliar contexts. Some of these aspects are part of the rubrics’ competence descriptions. In summary, and in relation to our first research question, we believe our tool for the assessment of a complex competence is of substantial quality. A final remark regarding the consistency of the ratings between the teachers in our project: we did not include this in our analysis. However, results from previous research (Rezaei and Lovorn, 2010 and Stuhlmann et al., 1999) indicate that training teachers on the use of rubrics enhances intra-judge reliability.
Although the results of our study are based on a limited pilot study, the interviews with the four teachers offered valuable insights into their experiences with the rubric as a tool for classroom assessment. Referring to our second research question, we conclude that the instrument served as a foundation for planning lessons that included sequences of formative assessment. The instrument c