Conclusions

Dense Matrix Multiplication

viņeta We obtained a maximum performance of 1200 MIOPS in MPI and OpenMP. The peak performance obtained was 1253.52 MIOPS. This result shows the maximum performance that T2000 can give us.
viņeta The performance of the system scales up to 32 parallel tasks in MPI and OpenMP. We can deduce that the underlying hardware supports 32 parallel tasks in an independent way. Although four threads run on a single core and these threads share an integer unit, the pipelining of the unit hides the memory latency efficiently.
viņeta We obtained a maximum speedup of 20 in MPI and OpenMP. We could not break the barrier of 20.82 of speedup. In MPI and OpenMP we reached speedups of 20.
viņeta We obtained a SWaP of 2 in MPI and OpenMP. This metric is useful to compare T2000 with other machines. We are working on this topic more and new results will be posted soon.

Sparse Matrix Multiplication

viņeta We obtained a maximum performance of 200 MIOPS in MPI and 400 MIOPS id OpenMP. This result shows the nature of shared memory of the machine. We can see that the performance in OpenMP is the double respect to MPI.
viņeta The performance of the system using MPI scales up to 16 processes and then decreases, but using OpenMP it scales up to 32 and remains constant.
viņeta Using MPI we obtained a maximum speedup of 9, and using OpenMP we obtained a maximum speedup of 16.
viņeta The maximum SWaP in MPI was 0.33 and the maximum SWaP in OpenMP was 0.66.
viņeta We can see the impact in performance of communication time overhead. Because of the operations involved in the sparse matrix multiplication the processing units spend time in the synchronization of the computation. This communication overhead is the responsible of the decease of the performance compared to dense matrix multiplication.

T2000

viņeta The machine behaves like 32 parallel independent threads. We can say that T2000 is a general purpose parallel machine.
viņeta T2000 can give a maximum of 1200 MIOPS. We obtained a peak performance of 1253.52 MIOPS.
viņeta We obtained a speedup of 20. Since we have 32 threads in the hardware, ideal speedup should be 32. The perfect frontier is extremely difficult to reach for most parallel computers. However, this result is impressing considering that four threads run on a single core and these threads share one integer unit.
viņeta The best of the machine is obtained using OpenMP. T2000 architecture is mainly a shared memory topology.
viņeta The startup and the installation of software were smooth. The documentation provided by Sun is clear and we cloud reach good results in a short period of time.

Organization of Data

You can find the detailed results and the explanation of the concepts used in the results section.