Validity[edit]
Validity refers to how well a tool measures what it intends to measure. With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent).The degree of validity of an instrument is determined through the application of logic/or statistical procedures." A measurement procedure is valid to the degree that if measures what it proposes to measure"
Another fundamental issue is that online ratings usually involve convenience sampling much like television polls, i.e. they represent only the opinions of those inclined to submit ratings.
Validity is concerned with different aspects of the measurement process.Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. Types of validity include content validity, predictive validity, and construct validity.
Sampling[edit]
Sampling errors can lead to results which have a specific bias, or are only relevant to a specific subgroup. Consider this example: suppose that a film only appeals to a specialist audience—90% of them are devotees of this genre, and only 10% are people with a general interest in movies. Assume the film is very popular among the audience that views it, and that only those who feel most strongly about the film are inclined to rate the film online; hence the raters are all drawn from the devotees. This combination may lead to very high ratings of the film, which do not generalize beyond the people who actually see the film (or possibly even beyond those who actually rate it).
Qualitative description[edit]
Qualitative description of categories improve the usefulness of a rating scale. For example, if only the points 1-10 are given without description, some people may select 10 rarely, whereas others may select the category often. If, instead, "10" is described as "near flawless", the category is more likely to mean the same thing to different people. This applies to all categories, not just the extreme points.
The above issues are compounded, when aggregated statistics such as averages are used for lists and rankings of products. User ratings are at best ordinal categorizations. While it is not uncommon to calculate averages or means for such data, doing so cannot be justified because in calculating averages, equal intervals are required to represent the same difference between levels of perceived quality. The key issues with aggregate data based on the kinds of rating scales commonly used online are as follow:
Averages should not be calculated for data of the kind collected.
It is usually impossible to evaluate the reliability or validity of user ratings.
Products are not compared with respect to explicit, let alone common[clarification needed], criteria.
Only users inclined to submit a rating for a product do so.
Data are not usually published in a form that permits evaluation of the product ratings.
More developed methodologies include Choice Modelling or Maximum Difference methods, the latter being related to the Rasch model due to the connection between Thurstone's law of comparative judgement[clarification needed] and the Rasch model.