Visual Analytics and Imaging Laboratory (VAI Lab)
Computer Science Department, Stony Brook University, NY

GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback

Eric Papenhausen, Bing Wang, Sungsoo Ha, Alla Zelenyuk, Dan Imre, Klaus Mueller

Abstract: Clustering is an important preparation step in big data processing. It may even be used to detect redundant data points as well as outliers. Elimination of redundant data and duplicates can serve as a viable means for data reduction and it can also aid in sampling. Visual feedback is very valuable here to give users confidence in this process. Furthermore, big data preprocessing is seldom interactive, which stands at conflict with users who seek answers immediately. The best one can do is incremental preprocessing in which partial and hopefully quite accurate results become available relatively quickly and are then refined over time. We propose a correlation clustering framework which uses MDS for layout and GPU-acceleration to accomplish these goals. Our domain application is the correlation clustering of atmospheric mass spectrum data with 8 million data points of 450 dimensions each.

Teaser: Below can be seen that the relevant clusters already emerge relatively early in the iterative reduancy clustering process. The rotations are only due to the repeated MDS layout process. Eliminating these rotations is subject of future work.

Teaser Image

Paper: B. Wang, P. Ruchikachorn, K. Mueller, “GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback,” The First IEEE Workshop on Big Data Visualization, Santa Clara, CA, October, 2013..pdf ppt

Funding: NSF grant IIS-1117132