Abstract
We present a study of OpenCL implementations of three important kernels that occur frequently in iterative statistical applications: multi-dimensional scaling (MDS), PageRank and K-means clustering. We evaluated their performance on NVIDIA Tesla and Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. We explored the optimisation of these kernels by four main techniques: 1) caching invariant data in GPU memory across iterations; 2) selectively placing data in different memory levels; 3) rearranging data in memory; 4) dividing the work between the GPU and the CPU. We also implemented a novel algorithm for MDS and a novel data layout scheme for PageRank. Our optimisations resulted in performance improvements of up to 5× to 6×, compared to naïve OpenCL implementations and up to 100× improvement over single-core CPU. We believe that these categories of optimisations are also applicable to other similar kernels.
Keywords
References
- 1. (2010). ‘Dimension reduction and visualization of large high-dimensional data via interpolation’. in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC). 203-214 Google Scholar
- 2. ‘Emergence of scaling in random networks’. Science. 1999, 10, 286, 5439, 509-512 Google Scholar
- 3. (2010). ‘Automatic C-to-CUDA code generation for affine programs’. in Proceedings of the 19th International Conference on Compiler Construction (CC). 244-263 Google Scholar
- 4. ‘Efficient sparse matrix-vector multiplication on CUDA’.
2008,
12,
NVIDIA Corporation,
NVIDIA Technical Report NVR-2008-004 Google Scholar - 5. (2005). Modern Multidimensional Scaling: Theory and Applications. 2nd ed., Springer, Statistics Google Scholar
- 6. ‘The anatomy of a large-scale hypertextual Web search engine’. in Proceedings of the Seventh International World Wide Web Conference, Computer Networks and ISDN Systems. 1998, 04, 30, 107-117 Google Scholar
- 7. (1988). ‘Convergence of the majorization method for multidimensional scaling’. Journal of Classification. 5, 2, 163-180 Google Scholar
- 8. (2011). ‘A new method for GPU based irregular reductions and its application to K-means clustering’. in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). Google Scholar
- 9. (2010). ‘Twister: a runtime for iterative MapReduce’. in Proceedings of the 19th ACM International Symposium on High Performance and Distributed Computing (HPDC). 810-818 Google Scholar
- 10. ‘Portable parallel programming on cloud and HPC: scientific applications of Twister4Azure’.
2011,
12,
in 2011 Fourth IEEE International Conference on Utility and Cloud Computing (UCC) , 97-104 Google Scholar - 11. ‘Parallel processing with CUDA’. 2008, 01, Microprocessor Report Google Scholar
- 12. (2009). ‘K-means on commodity GPUs with CUDA’. in Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering. Google Scholar
- 13. (2009). ‘Glimmer: multilevel MDS on GPU’. IEEE Transactions on Visualization and Computer Graphics. 15, 2, 249-261 Google Scholar
- 14. ‘General purpose computing on graphics processing units using OpenCL’. 2010, 06, Sweden:Chalmers University of Technology, Department of Computer Science and Engineering Göteborg , Masters thesis Google Scholar
- 15. (2010).
The OpenCL Specification 1.1’.
(accessed on January 2012) , available at http://www.khrono.org/ opencl/ Google Scholar - 16. (2011). ‘CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms’. Bioinformatics. 27, 16, 2321-2322 Google Scholar
- 17. (1978). Multidimensional Scaling. Beverly Hills, CA:Sage Publications Inc. Google Scholar
- 18. (2010). ‘OpenMPC: Extended OpenMP programming and tuning for GPUs’. in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). Google Scholar
- 19. (2009). ‘OpenMP to GPGPU: a compiler framework for automatic translation and optimization’. in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Google Scholar
- 20. (2010). ‘Speeding up K-means algorithm by GPUs’. in Proceedings of the 2010 IEEE 10th International Conference on Computer and Information Technology (CIT). 115-122 Google Scholar
- 21. ‘Least squares quantization in PCM’. IEEE Transactions on Information Theory. 1982, 03, 28, 2, 129-137 Google Scholar
- 22. NVIDIA CUDA Compute Unified Device Architecture: Programming Guide. 2007, 11, version 1.1 ed. Google Scholar
- 23. (2010). ‘Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark’. in Proceedings of the First International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS). Google Scholar
- 24. (2008). ‘Optimization principles and application performance evaluation of a multithreaded GPU using CUDA’. in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 73-82 Google Scholar
- 25. (2008). ‘Efficient K-means clustering using accelerated graphics processors’. in Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery (DaWak), Lecture Notes in Computer Science. 5182, 166-175 Google Scholar
- 26.
(accessed on January 2012) SNAP, Stanford Network Analysis Project, available at http://snap.stanford.edu/ Google Scholar - 27. , Reeve, M. (1989). ‘Bulk-synchronous parallel computers’. Parallel Processing and Artificial Intelligence. John Wiley & Sons, 15-22 Google Scholar
- 28. (2009).
‘GPU-accelerated large scale analytics’.
Hewlett Packard Ltd,
Technical Report HPL-2009-38 Google Scholar - 29. (2010). ‘Efficient PageRank and SpMV computation on AMD GPUs’. in Proceedings of the 39th International Conference on Parallel Processing (ICPP). 81-89 Google Scholar
- 30. (2011). ‘Fast sparse matrix-vector multiplication on GPUs: implications for graph mining’. Proceedings of the VLDB Endowment. 4, 4, 231-242 Google Scholar
- 31. (2010). ‘Graphics processing units and high-dimensional optimization’. Statistical Science. 25, 3, 311-324 Google Scholar