This study aimed to gain knowledge of students' beliefs and difficulties in understanding p-values, and to use this knowledge to develop improved teaching programs. This study took place over four consecutive teaching semesters of a one-semester tertiary statistics unit. The study was cyclical, in that the results of each semester were used to inform the instructional design for the following semester. Over the semesters, the following instructional techniques were introduced computer simulation, the introduction of hypothetical probabilistic reasoning using a familiar context, and the use of alternative representations. The students were also encouraged to write about their work. As the interventions progressed a higher proportion of students successfully defined and used p-values in Null Hypothesis Testing procedures.
1. INTRODUCTION
This study examined students' problems in understanding p-values, and the results of an intervention that aimed to improve this understanding. Null Hypothesis Testing (NHT) is one of the main techniques in inferential statistics, yet previous research has shown that the concept of the p- value can be problematic for students (Batanero, 2008: Gliner, Leech & Morgan, 2002; Nickerson 2000)
P-values have come about from the desire to estimate the likelihood that a sample was drawn from a population with a specified value for the population parameter. When a venous blood sample has been taken correctly, the sample will be like the blood in the rest of the venous system. In most sampling situations, however, it is extremely unlikely that a sample will be exactly representative of the population. If another sample were taken, it too is unlikely to be exactly representative of the population, and in addition, unlikely to be exactly like the first sample. Despite this, researchers know that the sample will in some way tend to resemble the population, and that it is still possible to make conclusions about the population, even if it is not possible to be absolutely certain about the accuracy conclusions.
One way around the problem of uncertainty is to perform NHT. With this process, a proposition (the null hypothesis) is made about a population parameter. A sample is then collected, the relevant sample statistic calculated, and a judgment is made as to how likely the sample statistic (or one even more extreme) would be if the proposition about the parameter were true. In the NHT process, this judgment is made by calculating a conditional probability, the probability of obtaining the sample with the given or more extreme statistic, if the population has the parameter proposed in the null hypothesis. It is this probability that is known as the p-value. One way to interpret this P-value is to compare it to a pre-set value. If this p-value is below the pre-set value, it is concluded that it is unlikely that the sample came from a population with the stated null hypothesis and the null hypothesis is rejected. If this p-value is above the pre-set value, then it is concluded that the sample could have come from a population with the proposed value and one fails to reject the null hypothesis.
Previous research shows that students of statistics can have problems understanding this process, and this lack of understanding can be undetected by their instructors because the students may follow the procedures accurately (Garfield & Ahlgren, 1988) It is only when questions are asked that require students to describe their reasoning that this lack of understanding is detected. The aim of this study was to gain knowledge of students' beliefs and difficulties in understanding p-values, and to use this knowledge to develop teaching programs to enhance student understandings of this concept. The research questions were: What are students' understandings of p-values? What misconceptions may they hold? And can teaching methods be developed to improve students' understandings?
1.1. LITERATURE REVIEW
A null hypothesis test starts with the statement of the null hypothesis containing the proposed value of the population parameter. Previous research shows that students may believe that this hypothesis refers to both the sample and the population, and are therefore confused about NHT form the very start the process (Sotos, Vanhoof,Ven den Noortgate,2007). It has also been found that students may carry out the procedures for NHT correctly, but then may misinterpret the results through lack of understanding of what rejecting and failing to reject the null hypothesis really indicates. This problem was investigated by Haller and Krauss (2002) who conducted a survey of staff and students, some of whom were statistics instructors, from the psychology departments of six universities. In this survey, an example of an independent samples t-test was given where the p-value was 0.01. Approximately 26% of the participants (including a small number of statistics methodology instructors) agreed with the statement: "You have found the probability of the null hypothesis being true." Approximately 69% of the participants (including approximately one third of the statistics methodology instructors) agreed with the statement: "You know, if you decide to reject the hypothesis, the probability that you are making the wrong decision." Those who agreed with this statement did not seem to be aware of the conditional nature of the probability the p-value represents. That is, the p-value is the probability of making the wrong decision if the null hypothesis is true.
The belief that the p-value is the probability that the null hypothesis is true appears to be a commonly held misconception. A related misconception is that 1-P is the probability that the alternative hypothesis is true. It may also be believed that rejecting a null hypothesis proves the underlying theory that predicted the rejection. It may also be believed that a low the value for the p-value suggests that the results are replicable (Nickerson, 2000)
1.2. WHY USE P-VALUES?
The use of the null-hypothesis test is widespread and p-values are reported widely in the literature. The way a p-value is used differs and is the subject of debate (cumming, 2010: Gliner, Leech, & Morgan, 2002: Hubbard & Lindsay, 2008). One way p values can be used, attributed to Neyman and Pearson, is that a pre-existing level of significance is chosen, and the null hypothesis is rejected if the p-value is less than this level of significance. This form of analysis leads to the possible calculation of Type I and Type II error rates. An alternative (advocated by Fisher) is to look at the level of support a
p-value gives to a null hypothesis. As the p-value decreases, the level of support given for the null hypothesis is also considered to decrease (Wagenmakers, 2007). Recently, however, the question has been asked: should p-values be used at all?
One tenet of a scientific experiment is that it should be replicable. Therefore, it would seem not unreasonable to assume that if an experiment should be repeatable then the p-value would also be replicable. Cumming (2010) has shown that in fact p-values vary much more from sample to sample than many researchers realise. Hubbard and Lindsay (2008) show that p-values can vary even with the same data, depending on the method of analysis chosen by the researcher and on whether the researcher has chosen a one- or two-tailed test.
Another problem with p-values is that they do not indicate the effect size. A small study with a large effect size can yield the same p-value as a large study with small effect size (Hubbard & Lindsay, 2008: Wagenmakers, 2007). In addition, there is concern about the validity of the way p-values are calculated. Assuming the null hypothesis is true, a p-value is the probability of the observed data and the probability of more extreme data, yet these more extreme data are not actually observed. It is questionable whether decisions should be made on unobserved data (Hubbard & Lindsay, 2008).
It is for these reasons that it has been suggested that the results of scientific experiments should instead be presented as confidence interval estimates of the parameters. Confidence intervals have the advantage that they are in the same units as the point estimate, and make it easier for the reader to determine if an effect is important, rather than just if it is statistically significant. of even more consequence is that confidence intervals give an idea of the precision of an estimate via the width of the interval. In addition, the width of the interval gives an idea of what the infinite set of possible results may look like (Cumming, 2010; Wagenmakers, 2007). The contrast between the variation in p- values and the variation in confidence intervals is graphically and amusingly illustrated by the "Dance of the p-values