Results April 21, 2006

Sparse Matrix Multiplication Using MPI

The code used in this test solves the system A*x = b when most of the values of matrix A are zeros. The parallel standard used was MPI. The code used was the Kokkos package from Sandia National Laboratories. The obtained results are shown in Table 4. The most interesting numbers are shown in red color. You can find the analysis on the bottom of this page.

 

MIOPS Number of Processes    
NEQ 1 2 4 8 16 32  
10 6.31175 0.22568 0.155431 0.150172 0.121329 0.0747725  
100 27.362 2.34889 1.65617 1.5803 1.15587 0.60949  
1000 29.2011 17.1269 14.154 14.7086 11.6795 5.97869  
10000 30.3893 41.5359 62.9246 87.7779 85.0076 53.8697  
100000 23.2974 36.6352 70.592 134.248 184.035 114.361  
1000000 22.9532 36.8063 73.345 143.571 198.773 109.486  
               
               
Speedup Number of Processes    
NEQ   2 4 8 16 32  
10   0.04 0.02 0.02 0.02 0.01  
100   0.09 0.06 0.06 0.04 0.02  
1000   0.59 0.48 0.50 0.40 0.20  
10000   1.37 2.07 2.89 2.80 1.77  
100000   1.57 3.03 5.76 7.90 4.91  
1000000   1.60 3.20 6.25 8.66 4.77  
               
               
Efficiency Number of Processes    
NEQ   2 4 8 16 32  
10   0.02 0.01 0.00 0.00 0.00  
100   0.04 0.02 0.01 0.00 0.00  
1000   0.29 0.12 0.06 0.02 0.01  
10000   0.68 0.52 0.36 0.17 0.06  
100000   0.79 0.76 0.72 0.49 0.15  
1000000   0.80 0.80 0.78 0.54 0.15  
               
               
Table 4: Sparse matrix multiplication results using MPI  

 

Analysis

The performance of the system scales up to 16 processes and then decreases. We can see that we obtain a maximum performance of 198.77 MIOPS when the number of processes is 16 and the NEQ is equal to 1000000. In general, we can say that the communication overhead is affecting the response of the machine. We can see that the shared memory nature of the machine is showing us that MPI libraries are not the best option.

We obtained an speedup of 8.66. We can see that the code runs 8.66 times faster that its serial version when we use 16 processes and NEQ is 1000000.

Maximum SWaP was obtained at 16 processes and NEQ equal to 1000000 (New !). If we consider a performance of 198.77 MIOPS, a space of 2 RU (rack units), and a power consumption of 300 Watts.