Type of Document Dissertation Author Belgin, Mehmet Author's Email Address firstname.lastname@example.org URN etd-12242010-124006 Title Structure-based Optimizations for Sparse Matrix-Vector Multiply Degree PhD Department Computer Science Advisory Committee
Advisor Name Title Back, Godmar V. Committee Co-Chair Ribbens, Calvin J. Committee Co-Chair Cameron, Kirk W. Committee Member Gugercin, Serkan Committee Member Sandu, Adrian Committee Member Keywords
- Code Generators
- Matrix Vector Multiply
- thread pool
- parallel SpMV
Date of Defense 2010-12-14 Availability unrestricted AbstractThis dissertation introduces two novel techniques, OSF and PBR, to improve the performance of Sparse Matrix-vector Multiply (SMVM) kernels, which dominate the runtime of iterative solvers for systems of linear equations. SMVM computations that use sparse formats typically achieve only a small fraction of peak CPU speeds because they are memory bound due to their low flops:byte ratio, they access memory irregularly, and exhibit poor ILP due to inefficient pipelining. We particularly focus on improving the flops:byte ratio, which is the main limiter on performance, by exploiting recurring structures or sub-structures in matrices. Our techniques also support micro-architecture level optimizations to further improve performance.
Operation Stacking Framework (OSF) stacks problems in large ensemble computations, which run the same sparse kernel using an identical matrix structure, such that they share a single copy of the indexing information to significantly reduce memory bandwidth usage. OSF provides performance improvements of up to 1.94x on an AMD Opteron compared to the CSR method. We validate performance results using hardware event counters, which demonstrate significantly improved cache and pipeline utilization.
Pattern-based Representation (PBR) exploits recurring block nonzero patterns by generating custom code for each recurring block pattern. In this way, no indexing data for individual nonzero elements are read from memory, reducing the overall size of the indices by up to 98%. Our code generator emits highly tuned codes that utilize SSE vectorization and software prefetching. PBR accurately identifies a block size that achieves optimal or near-optimal performance using a linear multiple regression performance model. On recent multicore machines, PBR provides performance improvements of up to 3.4x sequentially and 5x in parallel, compared to the CSR method. The PBR library we provide converts matrices at runtime, allowing our method to be used as a drop-in replacement for existing methods. We compare PBR’s overhead relative to its benefits and show that PBR is beneficial for many applications that repetitively call the SMVM kernel for the same matrix structure.
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Belgin_Mehmet_D_2010.pdf 4.92 Mb 00:22:45 00:11:42 00:10:14 00:05:07 00:00:26
If you have questions or technical problems, please Contact DLA.