Analysis of Covariance
Background
The analysis of covariance, a special case of regression analysis (Chapters 2-4), involves a continuous dependent variable and both cate gorical and continuous independent variables. The central purpose of an analysis of covariance is to compensate for influences on the dependent variable that interfere with direct comparisons among a series of categories. Situations arise, for example, where mean values differ to some extent because of the influence of one or more extraneous continuous vari ables.
The analysis of covariance seeks to remove the influence of these variables, called covariables, that bias direct comparisons among a series of categories. For example, the mean birth weights of a sample of infants differ when classified by mother’s smoking habits (never smoked, past smokers, and present smokers), but the three groups also differ by maternal age (covariable and present smokers),but the three groups also differ by maternal age (covariable). The contributions to the observed differences caused by maternal age, though of little interest, bias the evaluation of possible effects of maternal smoking exposure on birth weight. The analysis adjusts the mean birth weights among the three smoking categories so that compari sons are “free” from the influences of matermal age. The word free should not be taken too literally. Statistcal adjustment, which is at the center of the analysis of covariance, depends on the validity (or, at least, the goodness-of-fit) of a specific statical model. When this underlying model is appropriate, it is possible to remove effectively the influence of a covariable by regression techniques. Differences observed among the adjusted mean values are then no longer influenced by the effects of one or more extraneous variables. Similar to regression analysis,
Goodness of – fit is addressed as part of the analysis. The following sections concentrate on the simplest analysis of covariance followed by a less detailed description of more general and more complicated approaches.
Criterion
The basic analysis of covariance combines simple linear regression (Chapter 2) with one-way analysis of variance (reviewed in Chapter 1). The data are collected in pairs (a covariable and a dependent variable) and classified into a series of categories, illustrated in Table). The symbol x represents a continuous covariable and the sampled value of the dependent variable is denoted, as before, by y. The notation indicates the observation from the group. Furthemore, the covariable is related to the dependent of the sample mean values therefore, reflects and, on the average, differs among the g categories. The direct comparison of the sample mean values therefore, reflects any effects from the categorical classification as well as influences from the covariable, making direct evaluation of the role of the categorical variable difficult without compensating for the influence of the covariable If the variable difficult without compensating for the influence of the covariable. If the variable is unre lated to the dependent variable or does not differ among the g categories,then the comparison of the values is not affected by and an analysis of covariance is unnecessary.
An analysis of covariance can be viewed as a sequential investigation of the fit of three nested statistical models. As before, the criterion to examine the goodness-offit of these competing models is an F-statistic, used to contrast residual sums of squares.
Model I: Interaction
The key to an analysis of covariance is the feasibility of a specific statistical model, called for simplicity, model II. The effectiveness of model II is evaluated by a comparison with a more general statistical structure called model I (depicted in Figure
5.1) Model I postulates that the covariable is related to the dependent variable by a series of simple linear regression equation that differ among some or all of the groups. The property that the relationship between the covariable and the dependent variable differs among groups is an interaction. Like the previous description of an interaction (expression 4.26), the relationship between variables and Y depends on the value of a third variable (categories).
Specifically, a series of straight lines with different intercepts (a.) and different slopes (b) describes the sample data within each group. In symbols, the expected values of the dependent variable for each group is given by the simple linear equation
Where I indicates group membership and j indicates the specific observation within the group. This statistical model allows the parameters of each regression line (a.and b) to differ among the g categories (I= 1,2,.....,g).
A simple linear regression analysis applied to the n observed pairs (x.., y..) in each of the g categories yields an estimate of a and b denoted, as before, a, and b. The process involves a separate regression analyses, requires 2 g estimates, and produces g estimated linear regression lines. Furthermore, each estimated regression line certainly fails to fit the data perfectly and the lack of fit is measured by the residual sum of squares within each group. The residual sum of squares for the category is the sum of squared deviations of the points on the estimated regression line from each corresponding observation or, for category
Is the residual sum of squares for a specific group. This residual sum of squares divided by the variance of the dependent variable has a chi-square distribution with –2 degrees of freedom when the requirements hold for a regression analysis (independence,normality, equal variance, and linearity). The estimation and evaluation of each of these g regression equations is identical to the simple linear regression analysis described in hapter 2; the process is just repeated a times, once for each group.
An overall assessment of mode I comes from summarizing the total fit of the g regression equations by
The value Res(I)divided by the variance of Y also has a chi-square distribution under model I with degrees of freedom where is the total number of observations. It is necessary to make 2 g independent estimates to establish the N estimated values under model I, yielding degrees of freedom.
Model II: No Interaction
Model II allows the expected values among the a categories to differ but the relation shop between covariable and the dependent variable Y within each category is the same, namely linear with qual slopes (Figure 5.2) In symbols, model II is
Model II is a a special case of model I and provides the specific structure to statistically remove the influence of the covariable. A set of g+1 parameters deffer among categories but have the same slope, denoted b’ . Because the relationship between the independent and depenent variable is the same in each group, no interaction exists (no-interaction model). In other words, the relationship between variables x and y does not depend on the values of a third variable (categories).
The fit of model II is also measured by a residual sum of squares. For each of the g groups, the estimated linear model for the group is producing the summary measure of fit of the g regression equations
Or, similar to model I,
The quantity represented by Res (II) divided by the variance of Y has a chi-square distribution under model II with N- (g + 1 ) degrees of freedom. Because g+ 1 estimates are required to produce the N estimated values y , the degrees of freedom are N- (g + 1). The residual sum of squares is related to a chi-square distribution when, in addition to the four usual regreddion analysis requirements, the slopes of the regression lines within each group are identical. Note that the intercept values a generated by mode II are not equal to the a-values used to define mode I.
Because model II is a restriction of mode I, an F-statistic serves to assess wheter mode II is consistent with the data The formal F-statistic is
Which has an F-distribution with g- 1 and N – 2g degrees of freedom when the gslopes are equal, no interaction.
If the value F is unlikely to have occurred by chance, then model II is not a tractable representation of the data and adjustment by linear regression techniques is not meaningful. The rejection of model II indicates a statistical interaction---the relation-ship between x and Y differs among some or all of the g categories. Geometrically, a large difference between the fit of model I and model II indicates that the relationship between x and Y is not adequately represented by a set of parallel straight lines. A direct consequence of rejecting model II (accepting model I) is that a comparison of the mean values depends on the value of x (Figure 5.1) A comparison of the mean values at one choice of x produces a difference y-y that differs for the same comparison for other choices of x therefore,no general comparison can be made among the categories free of the influence of the covariable x.
When an F-statistic indicates that model II differs from model I by no more than chance variation, the no-interaction model provides a basis to remove the influence of the covariable x, yielding a set of adjusted mean values. Model II requires the differences between mean values to be the same for any choice of x. Adjusted mean values based on model II, free from the influence of the x-variate,are then directly compared to detect possible II, free from the influence of the x- variate, are then directly compared to detect possible differences associated with the g categories. Clearly,The validity or the close fit of model II is crucial because the computation of the adjusted mean values depends on the requirement that the regression lines be parallel within all groups compared.Failure to reject model II does not unequivocally imply that the model represents the relationships within the data. It simply means that in face of no evidence to the contrary, the a
Analysis of Covariance
Background
The analysis of covariance, a special case of regression analysis (Chapters 2-4), involves a continuous dependent variable and both cate gorical and continuous independent variables. The central purpose of an analysis of covariance is to compensate for influences on the dependent variable that interfere with direct comparisons among a series of categories. Situations arise, for example, where mean values differ to some extent because of the influence of one or more extraneous continuous vari ables.
The analysis of covariance seeks to remove the influence of these variables, called covariables, that bias direct comparisons among a series of categories. For example, the mean birth weights of a sample of infants differ when classified by mother’s smoking habits (never smoked, past smokers, and present smokers), but the three groups also differ by maternal age (covariable and present smokers),but the three groups also differ by maternal age (covariable). The contributions to the observed differences caused by maternal age, though of little interest, bias the evaluation of possible effects of maternal smoking exposure on birth weight. The analysis adjusts the mean birth weights among the three smoking categories so that compari sons are “free” from the influences of matermal age. The word free should not be taken too literally. Statistcal adjustment, which is at the center of the analysis of covariance, depends on the validity (or, at least, the goodness-of-fit) of a specific statical model. When this underlying model is appropriate, it is possible to remove effectively the influence of a covariable by regression techniques. Differences observed among the adjusted mean values are then no longer influenced by the effects of one or more extraneous variables. Similar to regression analysis,
Goodness of – fit is addressed as part of the analysis. The following sections concentrate on the simplest analysis of covariance followed by a less detailed description of more general and more complicated approaches.
Criterion
The basic analysis of covariance combines simple linear regression (Chapter 2) with one-way analysis of variance (reviewed in Chapter 1). The data are collected in pairs (a covariable and a dependent variable) and classified into a series of categories, illustrated in Table). The symbol x represents a continuous covariable and the sampled value of the dependent variable is denoted, as before, by y. The notation indicates the observation from the group. Furthemore, the covariable is related to the dependent of the sample mean values therefore, reflects and, on the average, differs among the g categories. The direct comparison of the sample mean values therefore, reflects any effects from the categorical classification as well as influences from the covariable, making direct evaluation of the role of the categorical variable difficult without compensating for the influence of the covariable If the variable difficult without compensating for the influence of the covariable. If the variable is unre lated to the dependent variable or does not differ among the g categories,then the comparison of the values is not affected by and an analysis of covariance is unnecessary.
An analysis of covariance can be viewed as a sequential investigation of the fit of three nested statistical models. As before, the criterion to examine the goodness-offit of these competing models is an F-statistic, used to contrast residual sums of squares.
Model I: Interaction
The key to an analysis of covariance is the feasibility of a specific statistical model, called for simplicity, model II. The effectiveness of model II is evaluated by a comparison with a more general statistical structure called model I (depicted in Figure
5.1) Model I postulates that the covariable is related to the dependent variable by a series of simple linear regression equation that differ among some or all of the groups. The property that the relationship between the covariable and the dependent variable differs among groups is an interaction. Like the previous description of an interaction (expression 4.26), the relationship between variables and Y depends on the value of a third variable (categories).
Specifically, a series of straight lines with different intercepts (a.) and different slopes (b) describes the sample data within each group. In symbols, the expected values of the dependent variable for each group is given by the simple linear equation
Where I indicates group membership and j indicates the specific observation within the group. This statistical model allows the parameters of each regression line (a.and b) to differ among the g categories (I= 1,2,.....,g).
A simple linear regression analysis applied to the n observed pairs (x.., y..) in each of the g categories yields an estimate of a and b denoted, as before, a, and b. The process involves a separate regression analyses, requires 2 g estimates, and produces g estimated linear regression lines. Furthermore, each estimated regression line certainly fails to fit the data perfectly and the lack of fit is measured by the residual sum of squares within each group. The residual sum of squares for the category is the sum of squared deviations of the points on the estimated regression line from each corresponding observation or, for category
Is the residual sum of squares for a specific group. This residual sum of squares divided by the variance of the dependent variable has a chi-square distribution with –2 degrees of freedom when the requirements hold for a regression analysis (independence,normality, equal variance, and linearity). The estimation and evaluation of each of these g regression equations is identical to the simple linear regression analysis described in hapter 2; the process is just repeated a times, once for each group.
An overall assessment of mode I comes from summarizing the total fit of the g regression equations by
The value Res(I)divided by the variance of Y also has a chi-square distribution under model I with degrees of freedom where is the total number of observations. It is necessary to make 2 g independent estimates to establish the N estimated values under model I, yielding degrees of freedom.
Model II: No Interaction
Model II allows the expected values among the a categories to differ but the relation shop between covariable and the dependent variable Y within each category is the same, namely linear with qual slopes (Figure 5.2) In symbols, model II is
Model II is a a special case of model I and provides the specific structure to statistically remove the influence of the covariable. A set of g+1 parameters deffer among categories but have the same slope, denoted b’ . Because the relationship between the independent and depenent variable is the same in each group, no interaction exists (no-interaction model). In other words, the relationship between variables x and y does not depend on the values of a third variable (categories).
The fit of model II is also measured by a residual sum of squares. For each of the g groups, the estimated linear model for the group is producing the summary measure of fit of the g regression equations
Or, similar to model I,
The quantity represented by Res (II) divided by the variance of Y has a chi-square distribution under model II with N- (g + 1 ) degrees of freedom. Because g+ 1 estimates are required to produce the N estimated values y , the degrees of freedom are N- (g + 1). The residual sum of squares is related to a chi-square distribution when, in addition to the four usual regreddion analysis requirements, the slopes of the regression lines within each group are identical. Note that the intercept values a generated by mode II are not equal to the a-values used to define mode I.
Because model II is a restriction of mode I, an F-statistic serves to assess wheter mode II is consistent with the data The formal F-statistic is
Which has an F-distribution with g- 1 and N – 2g degrees of freedom when the gslopes are equal, no interaction.
If the value F is unlikely to have occurred by chance, then model II is not a tractable representation of the data and adjustment by linear regression techniques is not meaningful. The rejection of model II indicates a statistical interaction---the relation-ship between x and Y differs among some or all of the g categories. Geometrically, a large difference between the fit of model I and model II indicates that the relationship between x and Y is not adequately represented by a set of parallel straight lines. A direct consequence of rejecting model II (accepting model I) is that a comparison of the mean values depends on the value of x (Figure 5.1) A comparison of the mean values at one choice of x produces a difference y-y that differs for the same comparison for other choices of x therefore,no general comparison can be made among the categories free of the influence of the covariable x.
When an F-statistic indicates that model II differs from model I by no more than chance variation, the no-interaction model provides a basis to remove the influence of the covariable x, yielding a set of adjusted mean values. Model II requires the differences between mean values to be the same for any choice of x. Adjusted mean values based on model II, free from the influence of the x-variate,are then directly compared to detect possible II, free from the influence of the x- variate, are then directly compared to detect possible differences associated with the g categories. Clearly,The validity or the close fit of model II is crucial because the computation of the adjusted mean values depends on the requirement that the regression lines be parallel within all groups compared.Failure to reject model II does not unequivocally imply that the model represents the relationships within the data. It simply means that in face of no evidence to the contrary, the a
การแปล กรุณารอสักครู่..
