The table below shows the scaling of VASP.4 code on the T3D. The system is l-Fe with a cell containing 64 atoms, Gamma point only was used, the number of plane waves is 12500, and the number of included bands is 384.
| cpu's | 4 | 8 | 16 | 32 | 64 | 128 |
| NPAR | 2 | 4 | 4 | 8 | 8 | 16 |
| POTLOK: | 11.72 | 5.96 | 2.98 | 1.64 | 0.84 | 0.44 |
| SETDIJ: | 4.52 | 2.11 | 1.17 | 0.61 | 0.36 | 0.24 |
| EDDIAG: | 73.51 | 35.45 | 19.04 | 10.75 | 5.84 | 3.63 |
| RMM-DIIS: | 206.09 | 102.80 | 52.32 | 28.43 | 13.87 | 6.93 |
| ORTHCH: | 22.39 | 8.67 | 4.52 | 2.4 | 1.53 | 0.99 |
| DOS : | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| LOOP: | 319.07 | 155.42 | 80.26 | 44.04 | 22.53 | 12.39 |
| | 100 | 99 | 90 | 90 | 80 |
The main problem with the current algorithm is the sub space
rotation. Sub space rotation requires the diagonalization of
a relatively small matrix (in this case
), and
this step scales badly on a massively parallel
machine. VASP currently uses either scaLAPACK or a fast
Jacobi matrix diagonalization scheme written by Ian Bush. On 64
nodes the Jacoby scheme requires around 1 sec to diagonalize the matrix,
but increasing the number of nodes does not improve the timing.
The scaLAPACK needs at least 2 sec and reaches this performance
already with 16 nodes.