Skip to main content
Skip main navigation
No Access

Iterative statistical kernels on contemporary GPUs

Published Online:pp 58-77https://doi.org/10.1504/IJCSE.2013.052118

We present a study of OpenCL implementations of three important kernels that occur frequently in iterative statistical applications: multi-dimensional scaling (MDS), PageRank and K-means clustering. We evaluated their performance on NVIDIA Tesla and Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. We explored the optimisation of these kernels by four main techniques: 1) caching invariant data in GPU memory across iterations; 2) selectively placing data in different memory levels; 3) rearranging data in memory; 4) dividing the work between the GPU and the CPU. We also implemented a novel algorithm for MDS and a novel data layout scheme for PageRank. Our optimisations resulted in performance improvements of up to 5× to 6×, compared to naïve OpenCL implementations and up to 100× improvement over single-core CPU. We believe that these categories of optimisations are also applicable to other similar kernels.

Keywords

GPUs, OpenCL, multi-dimensional scaling, MDS, PageRank, K-means clustering, iterative statistical applications, cloud GPUs, sparse matrix-vector multiplication, computational science, Amazon EC2 GPU instances

References

  • 1. Bae, S-H. , Choi, J.Y. , Qiu, J. , Fox, G.C. (2010). ‘Dimension reduction and visualization of large high-dimensional data via interpolation’. in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC). 203-214 Google Scholar
  • 2. Barabási, A-L. , Albert, R. ‘Emergence of scaling in random networks’. Science. 1999, 10, 286, 5439, 509-512 Google Scholar
  • 3. Baskaran, M.M. , Ramanujam, J. , Sadayappan, P. (2010). ‘Automatic C-to-CUDA code generation for affine programs’. in Proceedings of the 19th International Conference on Compiler Construction (CC). 244-263 Google Scholar
  • 4. Bell, N. , Garland, M. ‘Efficient sparse matrix-vector multiplication on CUDA’. 2008, 12, NVIDIA Corporation, NVIDIA Technical Report NVR-2008-004 Google Scholar
  • 5. Borg, I. , Groenen, P.J.F. (2005). Modern Multidimensional Scaling: Theory and Applications. 2nd ed., Springer, Statistics Google Scholar
  • 6. Brin, S. , Page, L. ‘The anatomy of a large-scale hypertextual Web search engine’. in Proceedings of the Seventh International World Wide Web Conference, Computer Networks and ISDN Systems. 1998, 04, 30, 107-117 Google Scholar
  • 7. de Leeuw, J. (1988). ‘Convergence of the majorization method for multidimensional scaling’. Journal of Classification. 5, 2, 163-180 Google Scholar
  • 8. Dhanasekaran, B. , Rubin, N. (2011). ‘A new method for GPU based irregular reductions and its application to K-means clustering’. in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). Google Scholar
  • 9. Ekanayake, J. , Li, H. , Zhang, B. , Gunarathne, T. , Bae, S-H. , Qiu, J. , Fox, G. (2010). ‘Twister: a runtime for iterative MapReduce’. in Proceedings of the 19th ACM International Symposium on High Performance and Distributed Computing (HPDC). 810-818 Google Scholar
  • 10. Gunarathne, T. , Zhang, B. , Wu, T-L. , Qiu, J. ‘Portable parallel programming on cloud and HPC: scientific applications of Twister4Azure’. 2011, 12, in 2011 Fourth IEEE International Conference on Utility and Cloud Computing (UCC), 97-104 Google Scholar
  • 11. Halfhill, T.R. ‘Parallel processing with CUDA’. 2008, 01, Microprocessor Report Google Scholar
  • 12. Hong-tao, B. , Li-li, H. , Dan-tong, O. , Zhan-shan, L. , He, L. (2009). ‘K-means on commodity GPUs with CUDA’. in Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering. Google Scholar
  • 13. Ingram, S. , Munzner, T. , Olano, M. (2009). ‘Glimmer: multilevel MDS on GPU’. IEEE Transactions on Visualization and Computer Graphics. 15, 2, 249-261 Google Scholar
  • 14. Johansson, M. , Winter, O. ‘General purpose computing on graphics processing units using OpenCL’. 2010, 06, Sweden:Chalmers University of Technology, Department of Computer Science and Engineering Göteborg , Masters thesis Google Scholar
  • 15. Khronos OpenCL Working Group (2010). The OpenCL Specification 1.1’. (accessed on January 2012), available at http://www.khrono.org/ opencl/ Google Scholar
  • 16. Kohlhoff, K.J. , Sosnick, M.H. , Hsu, W.T. , Pande, V.S. , and Alteman, R.B. (2011). ‘CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms’. Bioinformatics. 27, 16, 2321-2322 Google Scholar
  • 17. Kruskal, J.B. , Wish, M. (1978). Multidimensional Scaling. Beverly Hills, CA:Sage Publications Inc. Google Scholar
  • 18. Lee, S. , Eigenmann, R. (2010). ‘OpenMPC: Extended OpenMP programming and tuning for GPUs’. in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). Google Scholar
  • 19. Lee, S. , Min, S-J. , Eigenmann, R. (2009). ‘OpenMP to GPGPU: a compiler framework for automatic translation and optimization’. in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Google Scholar
  • 20. Li, Y. , Zhao, K. , Chu, X. , Liu, J. (2010). ‘Speeding up K-means algorithm by GPUs’. in Proceedings of the 2010 IEEE 10th International Conference on Computer and Information Technology (CIT). 115-122 Google Scholar
  • 21. Lloyd, S.P. ‘Least squares quantization in PCM’. IEEE Transactions on Information Theory. 1982, 03, 28, 2, 129-137 Google Scholar
  • 22. NVIDIA NVIDIA CUDA Compute Unified Device Architecture: Programming Guide. 2007, 11, version 1.1 ed. Google Scholar
  • 23. Pennycook, S.J. , Hammond, S.D. , Jarvis, S.A. , Mudalige, G. (2010). ‘Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark’. in Proceedings of the First International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS). Google Scholar
  • 24. Ryoo, S. , Rodrigues, C.I. , Baghsorkhi, S.S. , Stone, S.S. , Kirk, D.B. , Hwu, W.m.W. (2008). ‘Optimization principles and application performance evaluation of a multithreaded GPU using CUDA’. in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 73-82 Google Scholar
  • 25. Shalom, S.A.A. , Dash, M. , Tue, M. (2008). ‘Efficient K-means clustering using accelerated graphics processors’. in Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery (DaWak), Lecture Notes in Computer Science. 5182, 166-175 Google Scholar
  • 26. (accessed on January 2012) SNAP, Stanford Network Analysis Project, available at http://snap.stanford.edu/ Google Scholar
  • 27. Valiant, L.G. , Reeve, M. (1989). ‘Bulk-synchronous parallel computers’. Parallel Processing and Artificial Intelligence. John Wiley & Sons, 15-22 Google Scholar
  • 28. Wu, R. , Zhang, B. , Hsu, M. (2009). ‘GPU-accelerated large scale analytics’. Hewlett Packard Ltd, Technical Report HPL-2009-38 Google Scholar
  • 29. Wu, T. , Wang, B. , Shan, Y. , Yan, F. , Wang, Y. , Xu, N. (2010). ‘Efficient PageRank and SpMV computation on AMD GPUs’. in Proceedings of the 39th International Conference on Parallel Processing (ICPP). 81-89 Google Scholar
  • 30. Yang, X. , Parthasarathy, S. , Sadayappan, P. (2011). ‘Fast sparse matrix-vector multiplication on GPUs: implications for graph mining’. Proceedings of the VLDB Endowment. 4, 4, 231-242 Google Scholar
  • 31. Zhou, H. , Lange, K. , Suchard, M.A. (2010). ‘Graphics processing units and high-dimensional optimization’. Statistical Science. 25, 3, 311-324 Google Scholar