MD Data Analysis

The YSL Group seeks to develop data analysis techniques for unsupervised analysis of molecular dynamics trajectories and other noisy time series.

Molecular dynamics (MD) simulations are an increasingly powerful tool for the prediction and analysis of molecular structure and behavior. These simulations generate extremely large quantities of data and such data may not be readily human-interpretable. Consequently, data science tools such as dimensionality reduction, cluster analysis, change point detection, and others, are invaluable with respect to understanding the output of MD simulations.

A major drawback of conventional cluster analysis protocols is that they treat snapshots from MD simulations as a mere collection of independent data points; in other words, they do not preserve the temporal information stored in these data, which is of crucial importance for understanding molecular kinetics. The Lin group has developed an algorithm, named CATBOSS (Cluster Analysis of Trajectories Based on Segment Splitting), which relies on change point detection to perform segment-based clustering. That is to say, data points that are close in time will be grouped in the same state, unless they are drawn from significantly different distributions. This approach enables CATBOSS to preserve kinetic information and allow for overlap between states, rather than creating hard, linear boundaries which result in spurious state interconversions. CATBOSS accomplishes this task while simultaneously enhancing clustering resolution (even allowing for the detection of “hidden” degrees of freedom in certain cases) and reducing the memory complexity, compared to state-of-the-art point-based approaches.

The limitation of currently available change point detection algorithms that our group recognized while developing CATBOSS is the assumption that all transitions between states take place instantaneously; i.e., that a trajectory can be modeled as a mixture of distinct distributions corresponding to metastable states. Real-world trajectories, however, often include gradual transitions, during which data points are not drawn from a distribution corresponding to any particular state. In response to this issue, we have developed a novel change detection algorithm, dubbed BarT (Barycentric Transitions), wherein such transitions are modeled as weighted barycentric interpolations between Laplace distributions corresponding to metastable states. Work on implementing BarT into the CATBOSS protocol is ongoing.

Related YSL Group Papers

J. Damjanovic, J. M. Murphy, Y.-S. Lin, “CATBOSS: Cluster Analysis of Trajectories Based on Segment Splitting,” J. Chem. Inf. Model. 61, 5066‒5081 (2021).

J. Damjanovic, Y.-S. Lin, J. M. Murphy, “Modeling Changes in Molecular Dynamics Time Series as Wasserstein Barycentric Interpolations,” IEEE XPlore. Accepted.