Principle Components are a set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset.
In statistics, principal components analysis (PCA) is a technique for simplifying a dataset. It is a linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset while retaining those characteristics of the dataset that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the "most important" aspects of the data. But this is not necessarily the case, depending on the application.
The blue lines represent 2 consecutive principle components. Note that they are orthogonal (at right angles) to each other.
PCA is also called the (discrete) Karhunen-Loeve transform (or KLT, named after Kari Karhunen and Michel Loeve) or the Hotelling transform (in honor of Harold Hotelling). PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance. This advantage, however, comes at the price of greater computational requirement if compared, for example, to the discrete cosine transform. Unlike other linear transforms, the PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.
Principal components and related or derived methods are procedures for simplifying multivariate data with minimum loss of information. Almost all sets of data formed by many populations and many gene frequencies contain some internal redundancy that is measured, for instance, by the correlations existing between genes in a pair of populations. Two identical populations would have a correlation of one; as they differentiate from one another in the course of evolution, the correlation decreases. Thus, correlation between populations can be said to measure the history of their common descent. A distance is the opposite of a correlation; it is zero for identical populations and increases with their differentiation. For certain formulas of distance, one can express the exact interdependence between distance and correlation and, in the simplest case, the relationship between them is d = 1-- r , where r is a correlation and d a distance measure (to be further transformed, however, if one wants to make it proportional to evolutionary time). Thus, the matrix of correlations and that of distances between population pairs are in a one-to-one relationship, but they have opposite meanings.
Principal components offer the simplest mode of analysis of a set of populations-by-gene frequency data. Here is a very simple example of application, using only five populations and five genes.
|Gene ||Africa ||Asia ||Europe ||America ||Australia|
Gene-Frequency Data (%) on Five Populations and Five Genes Population()
In order to visualize the procedure more clearly, let us consider at the beginning only the first two of the five genes, and plot their frequencies as abscissa and ordinate of a Cartesian diagram. The five populations are represented by five points in the diagram below:()
We draw a straight line as close as possible to the five points, using the criterion that the sum of the distances from the points to the line (dotted lines) must be a minimum. Incidentally, this criterion differs from the familiar one employed for calculating usual regression lines, where the segments (the sum of squares of which is minimized) are parallel to the abscissa or to the ordinate. Here the segments are orthogonal to the line, which is called the (first) principal axis or component. The original values of the two gene frequencies can now be replaced with the five points P1, P2, P3, P4, and P5 on the principal axis, with an arbitrary scale (usually chosen so that the five points have zero mean and variance equal to 1).
The resulting PC representation is shown at the bottom of the diagram. By replacing the original data with their first PC value, we have lost some information, but the process of minimization has reduced the loss and we have gained in simplicity because we have a straight line instead of a bidimensional display.
Displays of principal-component values like those in the next figure are usually called PC maps. The distances of the populations in the map mimic their genetic differences, but relationships can be more or less distorted, depending on the amount of information lost. Naturally, the first PC in this figure differs from that in the first. The this last figure is based on five genes and first on only two of them. In the PC maps the two axes are often scaled to express the relative importance of the principal components: the first is more important than the second, etc.
Five populations (Africa, Asia, Europe, Americam and Australia with the five gene frequencies listed in the table above, represented by their first two principle components. ()