In the previous section we introduced the squared-correlation parameter r2 as a measure of linkage disequilibrium between a pair of SNPs. Before going into the issue of estimating this parameter from genotypic information, we would like to introduce another measure of linkage disequilibrium—the coefficient of association, which is denoted by D'. This measure is very popular among population geneticists. It intends at describing the relative level of linkage disequilibrium in the current population, in comparison to its level at the formation of the population. The parameter has the form of a ratio between two correlation coefficients. The numerator is the correlation coefficient computed from the given 2 x 2 table of haplotype frequencies in the population. The denominator is the correlation computed from another 2 x 2 table. The marginal frequencies of this new table are identical to those of the original table. However, one of the entries in the new table is set to be equal to zero. The latter table represents the distribution of haplotypes at the formation of the population (or the formation of the younger of the two SNP), before recombination could add its effect of reducing the correlation between the two SNPs by breaking up haplotypes. (Setting a value for a single entry, and setting the values of the marginal frequencies, uniquely determines the entries of the table. Moreover, in a table with given marginal frequencies, the maximal correlation coefficient is obtained by setting the appropriate cell entry equal to zero.) The exact formula for the computation of D' can be read off the code of the MATLAB function for its computation.

The parameter D' is computed from the haplotype frequencies. For example, the value of the D' coefficient between markers rs737865 and rs165688, based on the data given in Table 3, is 0.8506. This estimate was obtained in a very large sample. Consequently, it is very likely that it is a good approximation of the actual value of this parameter of association in the entire population. However, for our needs we would like to ask how accurately can one expect to estimate the parameter when the sample size is much smaller?

We planned a small simulation, which may help in gaining insight as to the relation between sample size, the actual information, and the accuracy of the statistical tool for inference. In this simulation, the outcome of Table 3 represented the true distribution in the population. A sample of a given size from this population was simulated. The EM algorithm was applied to that sample in order to produce the estimated distribution of haplotypes. The estimated value of D' was computed from this estimated distribution. This procedure was iterated 10,000 times in order to obtain a distribution for this estimate of D'. We examined the effect of two factors on the distribution of the estimate. One factor was the sample  