Exploiting the topology of a torus enhances principal component analysis of protein data

JAN 01, 2018

Topological features of the torus can be used to significantly reduce data projection errors in simple principal component analysis on high-dimensional periodic manifolds.

J. H. Majors

DOI: 10.1063/1.5020876

Exploiting the topology of a torus enhances principal component analysis of protein data internal name — Exploiting the topology of a torus enhances principal component analysis of protein data lead image

Finding structure and meaning in large data sets depends on having effective analysis methods, which are becoming harder to develop and tune as data sets continue to grow in size and complexity. Reducing high dimensional data to a few dimensions that can be visualized while retaining essential information is of the utmost importance. Principal component analysis (PCA) is a popular analysis approach because of its straightforward applicability in constructing low-dimensional collective coordinates via unitary linear transformations.

To account for periodicity in input data — best described using circular coordinate systems — generalizations of PCA project data on alternative geometries. In the case of protein dynamics, circular motion is commonly treated by nonlinear transformations or a mapping from the high-dimensional torus to Riemannian surfaces.

In The Journal of Chemical Physics, authors report dimensionality reduction of (protein) data via PCA on a torus. By acknowledging topological aspects of the coordinate space and how it relates to the structure of the data, they demonstrate that PCA on a high-dimensional torus has a surprisingly simple solution.

Periodic gaps in the data arise here from the typical distributions of backbone dihedral angles of proteins. Shifting these gaps to line up with the torus’ periodic barrier minimizes occasions where close data points become separated by projection. This significantly reduces errors that would otherwise arise from linearly projecting the data onto a geodesic describing the shortest path between points.

The solution is useful in applications of PCA to more than for just protein dynamics. “Any kind of PCA on Riemannian geometry with closed loops, i.e., periodicities in one or more dimensions, will benefit from regarding this projection problem in the analysis,” said co-author Florian Sittel, who also noted that the solution on the torus should work well with other geometries, too.

Source: “Principal component analysis on a torus: Theory and application to protein dynamics,” by Florian Sittel, Thomas Filk, and Gerhard Stock, The Journal of Chemical Physics (2017). The article can be accessed at https://doi.org/10.1063/1.4998259 .