What is test validity and test validation?
Tests themselves are not valid or invalid. Instead, we validate the use of a test score.
Tests are pervasive in our world. Tests can take the form of written responses to a series of questions, such as the paper-and-pencil SAT™, or of judgments by experts about behavior, such as those for gymnastic trials or for a work performance appraisal. The form of test results also vary from pass/fail, to holistic judgments, to a complex series of numbers meant to convey minute differences in behavior.
Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school.
Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes.
Validity is a matter of degree, not all or none.
Samuel Messick, a renowned psychometrician, defines validity as "...an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationale support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment." Messick points out that validity is a matter of degree, not absolutely valid or absolutely invalid. He advocates that, over time, validity evidence will continue to gather, either enhancing or contradicting previous findings.
Tests sample behavior; they don't measure it directly.
Most, but not all, tests are designed to measure skills, abilities, or traits that are and are not directly observable. For example, scores on the SAT measure developed critical reading, writing and mathematical ability. The score on the SAT that an examinee obtains when he or she takes the test is not a direct measure of critical reading ability, such as degrees centigrade are a direct measure of the heat of an object. The amount of an examinee's developed critical reading ability must be inferred from the examinee's SAT critical reading score.
The process of using a test score as a sample of behavior in order to draw conclusions about a larger domain of behaviors is characteristic of most educational and psychological tests. Responsible test developers and publishers must be able to demonstrate that it is possible to use the sample of behaviors measured by a test to make valid inferences about an examinee's ability to perform tasks that represent the larger domain of interest.
Reliability is not enough; a test must also be valid for its use.
If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose. Crocker and Algina (1986, page 217) demonstrate the difference between reliability and validity with the following analogy.
Consider the analogy of a car's fuel gauge which systematically registers one-quarter higher than the actual level of fuel in the gas tank. If repeated readings are taken under the same conditions, the gauge will yield consistent (reliable) measurements, but the inference about the amount of fuel in the tank is faulty.
This analogy makes it clear that determining the reliability of a test is an important first step, but not the defining step, in determining the validity of a test.
There are many different methods that can be used to establish the validity of a test's use.
Crocker and Algina (1986) point to three major types of validity studies: content validity, criterion-related validity, and construct validity. Recently, consequential validity is increasingly discussed as a fourth major type of validity.
These four types of validity studies include, and sometimes employ, additional concepts of validity. For the content validity of a test, both a face validity and curricular validity study should be completed. To establish criterion-related validity, either a predictive validity or a concurrent validity study can be used. To establish construct validity, convergent validity and/or discriminant validity studies are used. Evidence from content and criterion-related validity studies can also be used to establish construct validity. Consequential validity requires an inquiry into the social consequences of the test use which are unrelated to the construct being tested, but which impact one or more groups.
Several types of evidence should be used to build a case for valid test use. For example, in building a case for the use of the CLEP® exams for college course placement, the college may want to:
•Compare the test specifications with course requirements to see if there is sufficient overlap to be comfortable using evidence from the test in place of completion of a course.
•Complete a concurrent criterion-related validity study to determine the relationship between course grades and test scores.
•Compare results from the CLEP exams with results from classroom tests of the same topics to establish convergent validity evidence.
•Follow up with surveys of the students enrolled in subsequent classes, who tested out of prerequisite classes using CLEP, to determine whether they felt their preparation to be adequate.
Learn more about different types of validity evidence.
What is test validity and test validation?
Tests themselves are not valid or invalid. Instead, we validate the use of a test score.
Tests are pervasive in our world. Tests can take the form of written responses to a series of questions, such as the paper-and-pencil SAT™, or of judgments by experts about behavior, such as those for gymnastic trials or for a work performance appraisal. The form of test results also vary from pass/fail, to holistic judgments, to a complex series of numbers meant to convey minute differences in behavior.
Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school.
Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes.
Validity is a matter of degree, not all or none.
Samuel Messick, a renowned psychometrician, defines validity as "...an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationale support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment." Messick points out that validity is a matter of degree, not absolutely valid or absolutely invalid. He advocates that, over time, validity evidence will continue to gather, either enhancing or contradicting previous findings.
Tests sample behavior; they don't measure it directly.
Most, but not all, tests are designed to measure skills, abilities, or traits that are and are not directly observable. For example, scores on the SAT measure developed critical reading, writing and mathematical ability. The score on the SAT that an examinee obtains when he or she takes the test is not a direct measure of critical reading ability, such as degrees centigrade are a direct measure of the heat of an object. The amount of an examinee's developed critical reading ability must be inferred from the examinee's SAT critical reading score.
The process of using a test score as a sample of behavior in order to draw conclusions about a larger domain of behaviors is characteristic of most educational and psychological tests. Responsible test developers and publishers must be able to demonstrate that it is possible to use the sample of behaviors measured by a test to make valid inferences about an examinee's ability to perform tasks that represent the larger domain of interest.
Reliability is not enough; a test must also be valid for its use.
If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose. Crocker and Algina (1986, page 217) demonstrate the difference between reliability and validity with the following analogy.
Consider the analogy of a car's fuel gauge which systematically registers one-quarter higher than the actual level of fuel in the gas tank. If repeated readings are taken under the same conditions, the gauge will yield consistent (reliable) measurements, but the inference about the amount of fuel in the tank is faulty.
This analogy makes it clear that determining the reliability of a test is an important first step, but not the defining step, in determining the validity of a test.
There are many different methods that can be used to establish the validity of a test's use.
Crocker and Algina (1986) point to three major types of validity studies: content validity, criterion-related validity, and construct validity. Recently, consequential validity is increasingly discussed as a fourth major type of validity.
These four types of validity studies include, and sometimes employ, additional concepts of validity. For the content validity of a test, both a face validity and curricular validity study should be completed. To establish criterion-related validity, either a predictive validity or a concurrent validity study can be used. To establish construct validity, convergent validity and/or discriminant validity studies are used. Evidence from content and criterion-related validity studies can also be used to establish construct validity. Consequential validity requires an inquiry into the social consequences of the test use which are unrelated to the construct being tested, but which impact one or more groups.
Several types of evidence should be used to build a case for valid test use. For example, in building a case for the use of the CLEP® exams for college course placement, the college may want to:
•Compare the test specifications with course requirements to see if there is sufficient overlap to be comfortable using evidence from the test in place of completion of a course.
•Complete a concurrent criterion-related validity study to determine the relationship between course grades and test scores.
•Compare results from the CLEP exams with results from classroom tests of the same topics to establish convergent validity evidence.
•Follow up with surveys of the students enrolled in subsequent classes, who tested out of prerequisite classes using CLEP, to determine whether they felt their preparation to be adequate.
Learn more about different types of validity evidence.
การแปล กรุณารอสักครู่..