Results April 21, 2006

Dense Matrix Multiplication Using MPI

The code used in this test solves the system A*x = b when matrix A is full with values. The parallel standard used was MPI. The code was developed by Paul Sexton (see people section). The obtained results are shown in Table 2. The most interesting numbers are shown in red color. You can find the analysis on the bottom of this page.

 

MIOPS Number of Processes    
NEQ 1 2 4 8 16 32  
32 49.7247 28.9596 26.0056 94.4531 59.0807 27.1289  
64 53.8673 66.4422 81.6322 270.828 240.072 139.239  
128 59.4152 99.8142 160.423 417.785 555.635 491.84  
256 62.6578 118.913 219.988 466.841 720.063 925.779  
512 63.496 125.016 243.6 493.243 848.346 1157.83  
1024 61.7975 122.835 242.735 487.064 854.718 1236.56  
               
               
Speedup Number of Processes    
NEQ   2 4 8 16 32  
32   0.58 0.52 1.90 1.19 0.55  
64   1.23 1.52 5.03 4.46 2.58  
128   1.68 2.70 7.03 9.35 8.28  
256   1.90 3.51 7.45 11.49 14.78  
512   1.97 3.84 7.77 13.36 18.23  
1024   1.99 3.93 7.88 13.83 20.01  
               
               
Efficiency Number of Processes    
NEQ   2 4 8 16 32  
32   0.29 0.13 0.24 0.07 0.02  
64   0.62 0.38 0.63 0.28 0.08  
128   0.84 0.68 0.88 0.58 0.26  
256   0.95 0.88 0.93 0.72 0.46  
512   0.98 0.96 0.97 0.84 0.57  
1024   0.99 0.98 0.99 0.86 0.63  
               
               
Table 2: Dense matrix multiplication results using MPI  

 

Analysis

The performance of the system scales up to 32 processes. This fact shows us that the system is behaving like a parallel machine that supports multiple processes. We can see that we obtain a maximum performance of 1236.56 MIPOS when the number of processes is 32 and the NEQ is equal to 1024.

We obtained an speedup of 20. We can see that the code runs 20 times faster that its serial version when we use 32 processes and NEQ is 1024. Another remarkable aspect is when NEQ is fixed in 1024 and we increase the number of processes, the speedup grows almost linear with the number of parallel tasks.

Maximum SWaP was obtained at 32 processes and NEQ equal to 1024 (New !). If we consider a performance of 1236.56 MIPOS, a space of 2 RU (rack units), and a power consumption of 300 Watts.