There are five main stages for association studies: (1) Selection of population samples, (2) Determination of the level and influence of population structure on the sample, (3) Phenotyping the population sample for traits of interest (4) Genotyping
the population, for either candidate genes/regions or as a genome-wide scan and (5) Testing the genotypes and phenotypes for their associations (Box 1).
The choice of association test is the last step of the study and is mostly dependent on the previous steps according to the characteristics of the population that was used to collect the genotypic and phenotypic data (Breseghello and Sorrells 2006a; Breseghello and Sorrells 2006b; Lewis 2002). Furthermore, possible complications due to population structure in the study sample may adversely affect the association test results. The influence of population structure on each association study depends on the relatedness between sampled individuals in the studied population. Therefore, the populations amenable for association studies may be classified according to the level of relatedness between the individuals forming the association population.
In the following subsections, we will first discuss the influences of population structure on various association study designs, followed by examples of control for its influences by accounting for the relatedness between individuals forming the association population.
Most important constraint for the use of association mapping for crop plants is unidentified population substructuring and admixture due to factors such as adaptation or domestication (Thornsberry et al. 2001; Wright and Gaut 2005). Population structure creates genome-wide linkage disequilibrium between unlinked loci. When the allele frequencies between sub-populations of a species is significantly different, due to factors such as genetic drift, domestication or background selection, genetic loci that do not have any effect whatsoever on the trait may demonstrate statistical significance for their co-segregations with a trait of interest. Provided that a large number of neutral markers are available for estimation of genome wide effects of structure, it is possible to statistically account for such effects in association data analysis (Yu et al. 2006b).
In cases where the population structuring is mostly due to population stratification (Bamshad et al. 2004; Pritchard 2001) three methods are often acknowledged to be suitable for statistically controlling the effects of population stratification on association tests: (1) genomic control (GC) (Devlin et al. 2004; Devlin and Roeder 1999; Devlin et al. 2001), (2) structured association (SA) method including two extensions that are modified for the type of association study as case-control (SA-model) (Pritchard et al. 2000b) or quantitative trait association study (Q-model) (Camus-Kulandaivelu et al. 2006; Thornsberry et al. 2001), (3) unified mixed model approach (Q+K) (Yu et al. 2006b).
First method suggested for statistically controlling population structure was GC that assumes population structuring has equivalent effects on all loci genome-wide. In GC method, a small random set of markers (e.g., polymorphisms unlikely to affect the trait of interest) are used to estimate influence of population structure on the association test statistics (inflation factor), such that the significance of the association statistic (P value) estimated is adjusted to account for population structure. The general principle of GC is to use individual genomes from the sample, to estimate levels of confounding due to substructure and more direct relatedness such as familial relationship in the study and scale the final significance level of the association reported accordingly (Devlin et al. 2001).
Structured association methodology, utilizes marker loci unlinked to the candidate genes under investigation to infer subpopulation membership. The application of structured association to qualitative and quantitative traits is done using the appropriate model depending on the trait and population type, with either SA or Q models respectively. In application of SA for quantitative trait association (Q-model), a two stage procedure is constructed where for the first stage each subject's probability of membership in each subpopulation is estimated (Pritchard et al. 2000a; Pritchard et al. 2000b) and then in the next stage, a test of association is conducted using subpopulation membership as a variable for the anociation model tested (Pritchard et al. 2000b). In case-control studies, the probability of the SNP frequency distribution based on population structure is compared between the case and control samples. For quantitative traits, the population structure estimates are used as co-variates in the regression model that defines the correlation of the genotype with the phenotype (Camus-Kulandaivelu et al. 2006; Thornsberry et al. 2001).
In unified mixed model approach (aka Q+K model) of Yu and Pressoir et al.(2006b), a large set of random markers that can provide genome-wide coverage are used to estimate population structure (Q) and relative kinship matrix (K), which are fit into a mixed-model framework to test for marker-trait association. In the unified mixed-model approach, each of the factors that may confound association analysis, that is, familial relatedness between individuals (K) and relatedness due to population structure (Q) are considered as independent variables within the species population. In order to account for the combined affects of such relatedness factors, they are included as covariates into the regression model that defines the correlation between genotype and the phenotype during association testing.
The genetic makeup of the study population that was used to collect genotype and phenotype data defines the model and type of association statistics to be used for association tests. This will be discussed further in the next section.
If the individuals forming the study population are effectively unrelated, the study population may be considered a random sample of individuals from species population and is therefore equivalent to any natural population. The relatedness amongst the individuals forming the population can be either estimated using pedigrees (Emik and Terrill 1949) or inferred using molecular markers (Blouin 2003; Lynch and Ritland 1999; Oliehoek et al. 2006; Wang 2002). These individuals can either be selected from originally natural populations, or subselected from selections included in breeding programs, to form a classic association population. Selecting individuals from breeding programs offers the advantage of easy incorporation into future breeding programs, however the number of lineages incorporated in the association study becomes limited (Breseghello and Sorrells 2006a; Breseghello and Sorrells 2006b).
All the previously mentioned statistical methods for population structure inferences are applicable to the classic association populations; however Q+K model has the widest base of applicability across all structured association study designs in natural populations.
In plants, so far the focus has been on quantitative traits in natural populations. In maize, using diverse inbred lines it was possible to select a sample of 102 lines with relatively few closely related individuals by sampling across the world's breeding programs (Remington et al. 2001a; Thornsberry et al. 2001). However, as larger samples were gathered to increase statistical power to over 300 maize lines it became extremely difficult to find samples that match the structure expected in natural populations (Flint-Garcia et al. 2005). These are the cases where the combined natural and family based approaches are most powerful (Yu et al. 2006a). In Arabidopsis (Nordborg et al. 2005), natural samples were collected from around the world but because of strong population structure and selfing, these samples in many respects behave more like families for association mapping purposes (Aranzana et al. 2005). Association studies with some tree species are more likely to fall into the model of effectively unrelated individuals (Gonzalez-Martinez et al. 2006b; Thumma et al. 2005). Most crop plant studies will probably fall on a continuum between natural and family-based association populations.
If the association population is a collection of unrelated families, instead of single unrelated individuals, it is possible to perform a joint linkage and association analysis on the population, that potentially can be more informative on the trait of interest than either approach alone (Holte et al. 1997; Karayiorgou et al. 1999). For instance, in human genetics, where the association populations are collections of parent-offspring trios, two types of study design is considered: transmission disequilibrium tests (TDTs) (Allison 1997; Fulker et al. 1999; Monks et al. 1998; Rabinowitz 1997; Spielman et al. 1993) , family based association tests (FBATs) (Herbert et al. 2006; Horvath et al. 2001; Laird et al. 2000; Laird and Lange 2006; Lake et al. 2000; Lange et al. 2003). Stich et al. (2006) modified the QTDT algorithm to test its applicability to inbred plant populations, and developed a model named Quantitative Inbred Pedigree Disequilibrium Test (QIPDT), for analysis of joint linkage and association data from crop plant populations. Another family based population design that was essentially developed for crop and livestock breeding is the Henderson'sMixedModel Approach (Henderson 1975), generally known forits applications in Best Linear Unbiased Predictors (BLUPs). Family based association study design investigates co-segregation and linkage simultaneously (Spielman et al. 1994).
A long standing mixed model method has been used by animal scientists to analyze the data from extended pedigree in dairy or cattle breeding programs (Henderson 1975; Henderson 1976; Henderson 1984). The superiority of the mixed model lies in its incorporation of the phenotypic observations from relatives of an individual into the estimation of the breeding value of that individual. The amount of information that is incorporated depends on the heritability of the trait and the genetic relationships (traditionally defined by pedigree information) among individuals. Naturally, this method has been extended to quantify the single gene effect while accounting for the pedigree relationship (Kennedy et al. 1992) and is applicable to association mapping with family based association populations. Taking this mixed model framework, Yu et al. (2006b) suggested to replace the pedigree-based co-ancestry with a marker-based relative kinship (K) to account for the relatedness among individuals.
This unified mixed model approach is demonstrated to be the most powerful statistic compared to all the rest of the statistics, for the family based association studies and those studies falling between classical and family-based designs. The flexibility and generality of this approach allow association studies to be carried out on any population without the restriction on the specific family structure.
Recently, the field of plant association genetics pioneered the use of a new type of association population, designed to incorporate advantages of both linkage based and linkage disequilibrium based quantitative trait dissection approaches in association studies, in a stronger design than Transmission-Disequilibrium Test (TDT)
design. This builds off of some of the joint linkage-association approaches encountered in cattle breeding (Blott et al. 2003; Meuwissen and Goddard 1997). The Nested Association Populations (NAM) are developed through controlled crosses between a diverse selection of unrelated individuals according to a breeding scheme that aims shuffling of alleles in diverse samples either across backgrounds or against a reference background while keeping track of number and locations of the recombination events that shuffle the parental chromosomes (Yu et al. 2006a). The subsequent generations of progeny of the crosses can then used as association populations. A population generated according to this described scheme not only provides tremendous power to the statistical tests of association, but also enables the projection of genotype information from the parents to the progeny optimizing genotyping cost for large studies. The cross design is expected to effectively reduce many of the affects of admixture and population structure on the association population. For such populations, a two step procedure for associations is suggested.
The two stage study design of nested association mapping requires deep sequencing or genotyping of the parents for SNP identification across the genome followed by lower density genotyping in the progeny in order to infer the locations of the recombination breakpoints during the crosses. Once the recombination breakpoints are localized and the recombination blocks are traced back to the contributing parent, the haplotype information from the parents can be directly projected on the progeny genome, without further need for genotyping within these blocks.
This design scheme enables the researcher to utilize the advantages of both linkage based and linkage disequilibrium based genetic mapping approaches. It provides genome wide coverage, with high resolution and is performed on an experimental cross that is robust to genetic heterogeneity with representation of several alleles per loci in a large population.
Because of the balanced design, straightforward multiple regression approaches can be applied (Yu et al. 2006a) for association testing. Currently, availability of such nested association populations are reported for maize (Yu et al. 2006a) and loblolly pine (Baltunis 2005; Ersoz 2006; Kayihan et al. 2005). Further statistical methods that are going to utilize and combine information from both parent and progeny generations for NAM type populations are currently under development.
These mentioned association population structures represent the continuum of LD levels from low in classic association populations towards high in biparental breeding populations. Nested association populations that are similar to heterogenous intermated populations (Niebur et al. 2004) fall in the mid-range of this continuum with moderate levels of LD and linkage.
Was this article helpful?
Are you sick of feeling like the whole world Is spinning out of control. Do You Feel Weak Helpless Nauseous? Are You Scared to Move More Than a Few Inches From The Safety of Your Bed! Then you really need to read this page. You see, I know exactly what you are going through right now, believe me, I understand because I have been there & experienced vertigo at it's worst!