next up previous contents
Next: 4 Parallelization of vasp.4 Up: 3 The installation of Previous: 3.10 Performance of serial

3.11 Performance of parallel code on T3D

The table below shows the scaling of VASP.4 code on the T3D. The system is l-Fe with a cell containing 64 atoms, Gamma point only was used, the number of plane waves is 12500, and the number of included bands is 384.

cpu's 4 8 16 32 64 128
NPAR 2 4 4 8 8 16
POTLOK: 11.72 5.96 2.98 1.64 0.84 0.44
SETDIJ: 4.52 2.11 1.17 0.61 0.36 0.24
EDDIAG: 73.51 35.45 19.04 10.75 5.84 3.63
RMM-DIIS: 206.09 102.80 52.32 28.43 13.87 6.93
ORTHCH: 22.39 8.67 4.52 2.4 1.53 0.99
DOS : 0.00 0.00 0.00 0.00 0.00 0.00
LOOP: 319.07 155.42 80.26 44.04 22.53 12.39
tex2html_wrap_inline4415 100 tex2html_wrap_inline4417 99 tex2html_wrap_inline4417 90 tex2html_wrap_inline4417 90 tex2html_wrap_inline4417 80 tex2html_wrap_inline4417

The main problem with the current algorithm is the sub space rotation. Sub space rotation requires the diagonalization of a relatively small matrix (in this case tex2html_wrap_inline4427 ), and this step scales badly on a massively parallel machine. VASP currently uses either scaLAPACK or a fast Jacobi matrix diagonalization scheme written by Ian Bush. On 64 nodes the Jacoby scheme requires around 1 sec to diagonalize the matrix, but increasing the number of nodes does not improve the timing. The scaLAPACK needs at least 2 sec and reaches this performance already with 16 nodes.


next up previous contents
Next: 4 Parallelization of vasp.4 Up: 3 The installation of Previous: 3.10 Performance of serial

MASTER USER VASP
Mon Mar 29 10:38:29 MEST 1999