That is, unless we could generate a model of why the groups’ histograms differed in shape and, as a result, conclude that the different shapes were just two versions of random error, we would probably be wary of viewing the difference between the two averages as representing something like the “gender effect on height.”