This site will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device.
Contributors: Zaid Abdo, Ursel Schuette, Stephen J. Bent, Christopher J. Williams, Larry J. Forney, and Paul Joyce
The analysis of Terminal Restriction Fragment Length Polymorphisms (T-RFLP) of 16S rRNA genes has proven to be a facile means to compare microbial communities and presumptively identify abundant members. The method provides data that can be used to compare different communities based on similarity or distance measures. Once communities have been clustered into groups, clone libraries can be prepared from sample(s) that are representative of each group in order to determine the phylogeny of the numerically abundant populations in a community. We have developed an approach for the statistical analysis of T-RFLP data that includes objective methods for (a) determining a baseline so that "true" peaks in electropherograms can be identified, (b) a means to compare electropherograms and bin fragments of similar size, (c) clustering algorithms that can be used to identify communities that are similar to one another, and (d) a means to select samples that are representative of a cluster that can be used to construct 16S rRNA gene clone libraries.
Signals typically have much larger areas (and higher peak heights) than background noise and hence add more to the variation in the data. The variance is calculated by assuming the true mean of the background fluorescence is zero. Large peaks are considered to be outliers and are progressively eliminated from the dataset. The calculation is done recursively until there are no large peaks to be removed and the remaining variation represents the background 'noise' alone.
Binning is done by first pooling all the data (fragment lengths) from all the samples. These fragment lengths are then sorted so that duplicate lengths are eliminated. Hierarchical clustering is performed to identify those fragments with lengths close enough to be grouped in the same length category (or bin). Fragment lengths binned together are then represented by their average length. A matrix is created with the new representative fragment lengths being in the first column, each of the subsequent columns contains the peak areas associated with each of the samples.
Typically, the number of different kinds of communities (clusters) is subjectively determined - a process that is fraught with problems. We identified three algorithms to assess the number of clusters in a dataset: the pseudo F, the CCC (Cubical Clustering Criterion), and the pseudo T2. We recommend that the optimal number of clusters be identified by first evaluating both the pseudo F and the CCC indices. If they are in agreement, the pseudo T2 can be used to corroborate the conclusion. If the pseudo F and the CCC not in agreement, the number of clusters is chosen based on the highest pseudo T2 index value.
There are four possible methods. Two of these utilize the coefficient of variation as a decision rule, while the other two use the percent cover as a decision rule. The decision on which method to use depends on how much of the variation in the cluster the researcher wants to explain. The lowest resolution results from using the Systematic Cover method, which focuses on richness alone in choosing a sample. The advantage is that smaller sample sizes will be chosen using this method as compared to the other methods. The highest resolution results from choosing a sample using the Maximum Variation method, which aims at explaining as much of the variation in the cluster as possible with a disadvantage of having to deal with large sample sizes compared to the other methods.
The goal is to insure that all of the common types in the population are sampled with high probability. In general, if Po is the minimum population frequency for a common type and 1-α is the probability that all the common types are sampled, then the general formula for the sample size is given by: n = ln(α*Po)/ln(1-Po)