Genome-wide association studies (GWAS) have been remarkably successful at identifying the genomic locations of variants involved in a variety of complex diseases [2]–[7]. In spite of this success, some researchers have expressed disquiet at the issue of the ‘missing heritability’ [8], namely the fact that the disease-associated single nucleotide polymorphisms (SNPs) identified through GWAS often account for only a small proportion of the the observed correlations in phenotype between relatives. This suggests that additional genetic factors remain to be found. Several explanations for this phenomenon have been suggested. Firstly, the SNPs identified through GWAS are likely to be surrogates in (imperfect) linkage disequilibrium (LD) with the true causal variants, and thus cannot be expected to fully account for their effects, particularly if the true causal variants are rare. Secondly, the low power of GWAS to detect loci of small effect means that many specific true loci remain undiscovered, even though the fact of their (combined) existence may be detectable from the observed genetic data [9], [10]. Finally (and the main focus of this communication) is the fact that the single-locus (SNP by SNP) testing strategy generally undertaken as the primary analysis tool in a GWAS may be underpowered to detect loci that interact with other genetic or enviromental factors, since effects at such loci may not be visible unless the contributing interacting factors are also taken into account.
The relationship between biological and statistical interaction has been hotly debated over many years [11]–[19]. It is now generally accepted that the lack of direct correspondence between statistical and biologial interaction makes it difficult to make strong inferences concerning biological mechanism from the existence of interaction terms in a statistical model. Nevertheless, the existence of such terms does imply that the interacting factors should at least both be ‘involved’ in disease in some way. Detection of statistical interaction thus provides a good starting point for a more focussed investigation of the joint involvement of the relevant factors, which can perhaps be better addressed through other types of experimental data. In addition, the increased detection power provided by statistical models that include interaction terms, when such terms do in fact operate [20], motivates the development of improved methods for detecting and modelling statistical interaction, particularly in the context of GWAS. The hope is that such methods will be useful for detecting effects that may be missed in standard single-locus analysis, thus providing a complementary strategy to standard GWAS analysis approaches for detecting loci involved in disease.
In case/control studies, statistical interaction is generally modelled as departure from a simple linear model describing the individual (main) effects of predictor variables on the predicted log odds of disease [17]. Consider two binary variables, An external file that holds a picture, illustration, etc. Object name is pgen.1002625.e001.jpg and An external file that holds a picture, illustration, etc. Object name is pgen.1002625.e002.jpg, whose presence/absence (coded 0/1) is believed be associated with a disease outcome. Logistic regression models the main effects (An external file that holds a picture, illustration, etc. Object name is pgen.1002625.e003.jpg and An external file that holds a picture, illustration, etc. Object name is pgen.1002625.e004.jpg) and interaction term (An external file that holds a picture, illustration, etc. Object name is pgen.1002625.e005.jpg) between the variables via the linear model