Page OpenProblemsClustering

Open Problem: Incremental Clustering

Introduction

Currently, we are using k-means clustering on a complete MD dataset to generate microstates which are later merged into metastable states using, e.g., PCCA. In an adaptive MD simulation framework, this approach is problematic as new/updated Markov model needs to be build whenever new simulations are available. Recomputing the clustering from scratch each time would be computationally ineffictive; additionally the stochastic nature of k-means is likely to introduce instabilities into the framework. Goal: Find or design an algorithm which incrementally adapts an appropriate microstate clustering when new data is available. Such an algorithm should be fast (linear or log-linear in the number of datapoints) and guarantee some basic smoothness properties (no sudden jumps in the results when adding a few datapoints). Ideally, it should guarantee that kinetically separated data are not merged. It may be based on crisp or soft clustering.

Large Datasets

The algorithm should be suitable for very large datasets, such as found in folding@home simulations. The currently (spring 2008) largest dataset (8 milliseconds of villin headpiece) has 160 million structures. In the view of future simulations, the algorithm should be able to deal with several billions ($10^9$) of structures. This certainly requires that the algorithm must have less than $N^2$ time complexity, where N is the number of structures, but at most $N log N$ or $N log K$ with $K<N$ being some centroids or representants. Probably this runtime can only be achieved if there is some sort of hierarchical datastructure (e.g. binary tree) which organizes the state space in such a way that closeness or neighborhood can be quickly found by walking down the hierarchy, avoiding iterating over datapoints or clusters. This is nontrivial in high dimesions as it cannot be achieved by some sort of grid.

Incrementability

When new datapoints become available, the clustering should be updated with an operation that costs $log N$ or $log K$ per data point.

Stability

Discontinuities are to be avoided. New datapoints should not cause sudden major changes. One idea towards this requirement is fuzzy/soft clustering, such as proposed in Marcus Webers' PhD thesis. In the case of soft clusters, every datapoint has a membership to every cluster or every other datapoint. However this somehow seems to contradict the requirement of being fast and using a hierarchical structure to partition the data. A way out of this dilemma could be if the smoothness or fuzzyness is very local. Fuzzyness in this case is a tool to average over local discontinuities in the state space caused by poor sampling, but there should be some cutoff beyond which datapoints or clusters are not related (see, e.g. diffusion maps - heat kernel). A hierarchical structure could be used to determine the possible neighbors.

Avoid combining kinetically separated sets

Since this clustering is a step before modeling the kinetics or further coarse-graining the clustering via something like PCCA+, it is absolutely necessary that two sets of points are not clustered together if there is a clear indication that these sets should be in two different metastable sets. Thus, apart from closeness of datapoints it may be useful to take the sequence of datapoints into account already in this step. Another method worth looking into is the approach by Bettina Keller (ETH Zürich - Hünfeld Presentation/Poster. Contact her). She improves clustering by looking at local features. This seems to be a good idea.

Unique Solution

In contrast to k-means the solution (cluster assignment) should be ideally unique and thus reproducible. That means that if data from different trajectories is presented to the algorithm in different trajectory orders, the result should always be the same.

Comments

It may be that not all of these requirements can be fulfilled. It is a good idea to try the effect of different features one after the after first using toy data and then real trajectory data. A good test is to use a self-mady metastable dynamical system with known timescales to generate trajectory data. A good clustering algorithm is one that quickly identifies a clustering which allows the correct timescales to be recovered.