While organizations are and should be interested in selecting the highest quality workforce
possible, many are also concerned about selecting a diverse workforce and not using measures that will systematically produce adverse impact against protected
groups. At the same time, if there are important job requirements, such as sufficient
upper body strength to perform a firefighter job, an organization would be remiss in
not considering this factor in making selection decisions, even if it means that a disproportionately
small proportion of females will be hired.
If an assessment method is shown to produce adverse impact and the organization
wishes to continue the use of that assessment, there are legal requirements to ensure
that the method must have demonstrated validity. If an organization uses an assessment
that produces adverse impact without the validity evidence, the organization will
be vulnerable to legal challenges against which it will not be able to prevail. While evidence
of validity can be used to justify and defend the use of measures that produce an
adverse impact, many organizations nonetheless attempt to mitigate the adverse impact
produced by their assessment methods to the extent possible in order to minimize
potential legal issues and lack of diversity concerns.
Because adverse impact analyses reflect the proportion of majority versus protected
group members who are ultimately selected for a job, they cannot be computed until
after the assessment process is complete and final selection decisions are made. This is
obviously very late for organizational decision makers to realize that the assessment
may have undesirable levels of adverse impact.
For this reason, researchers and practitioners often examine other statistics that can be
calculated much earlier in the process to determine the likelihood that an assessment
method will produce adverse impact. Specifically, one can compare the average scores
that different demographic group members receive on an assessment that is being considered
for implementation. Continuing with our upper body strength test example,
this would be accomplished by calculating the average score for the group of 50 females
who took the test and the average score for the group of 50 males who took the test.
These average scores would then be transformed into a statistic that represents the difference
between how the two groups performed on the test. This statistic is commonly
referred to as either an “effect size” or a “group difference in standard deviation units.”
Typical “effect sizes” range from 0, indicating no difference on average in how two
groups performed on an assessment, to 1.00 or more, indicating a very large difference
in how the two groups performed. Effect sizes in the .10 to .30 range are considered
small, those in the .30 to .70 range are considered moderate, and those above .70 are
considered large. All else being equal, an effect size of .70 to 1.00 or more on an assessment
can be expected to produce a large adverse impact in the final selection decisions.
Even smaller effect sizes (e.g., in the .30 to .40 range) can produce adverse impact in
final selection decisions. It is important to understand how to interpret an effect size