next up previous contents index
Next: Parallelization of VASP.4 Up: The installation of VASP Previous: Performance of serial code   Contents   Index


Performance of parallel code on various machines

For historic reasons, we show the scaling of VASP.4 code on the T3D. The system is l-Fe with a cell containing 64 atoms, the $ \Gamma $ point only was used, the number of plane waves was 12500 and the number of included bands is 384.

cpu's 4 8 16 32 64 128
NPAR 2 4 4 8 8 16
POTLOK: 11.72 5.96 2.98 1.64 0.84 0.44
SETDIJ: 4.52 2.11 1.17 0.61 0.36 0.24
EDDIAG: 73.51 35.45 19.04 10.75 5.84 3.63
RMM-DIIS: 206.09 102.80 52.32 28.43 13.87 6.93
ORTHCH: 22.39 8.67 4.52 2.4 1.53 0.99
DOS : 0.00 0.00 0.00 0.00 0.00 0.00
LOOP: 319.07 155.42 80.26 44.04 22.53 12.39
$ t / t_{opt}$   100 $ \%$ 99 $ \%$ 90 $ \%$ 90 $ \%$ 80 $ \%$

Figure 1: Scaling for a 256 Al system.
3mm
\includegraphics[width=9cm clip=true]{origin_new.eps}
The main problem with the current algorithm is the sub space rotation. Sub space rotation requires the diagonalization of a relatively small matrix (in this case $ 384 \times 384$), and this step scales badly on a massively parallel machine. VASP currently uses either scaLAPACK or a fast Jacobi matrix diagonalisation scheme written by Ian Bush (T3D, T3E only). On 64 nodes, the Jacoby scheme requires around 1 sec to diagonalise the matrix, but increasing the number of nodes does not improve the timing. The scaLAPACK requires at least 2 seconds, and scaLAPACK reaches this performance already with 16 nodes.

Figure 2: Scaling of bench.PdO on a PC cluster with Gigabit ethernet.
3mm
\includegraphics[width=12cm clip=true]{scalePdO_3.2G.eps}
Fig. 2 shows a more representative result on an SGI 2000 for 256 Al atoms. Up to 32 nodes an efficiency of 0.8 is found. A similar efficiency can be expected on most current architecture with large communication band-width (Infiniband, Myrinet, SGI etc.). On a Gibgabit ethernet based cluster, you can expect an efficiency of up to 75 % for up to 16-32 cores.

Figure 3: Scaling for a 512 atom GaAs system. The $ \Gamma $ point only version was used and the total number of filled bands is 1024. The default plane wave cutoff of 208 eV was used. Other VASP settings are PREC = A ; ISYM = 0 ; NELMDL = 5 ; NELM = 8 ; LREAL = A . The left panel shows the timing for RMM-DISS (ALGO = V), the right for Davidson (ALGO = N). The time for the 7th SCF step is reported.
3mm
\includegraphics[width=16cm clip=true]{GaAs.eps}
The final figure Fig. 3 shows the scaling for an in-house state of the art machine build by SGI (narwal). The nodes are linked by a QDR Infiniband switch, and each node consists of 8 cores (with two Intel(R) Xeon(R) CPU E5540 CPU's, 2.53GHz). In this case, the RMM-DIIS algorithm shows very good parallel efficiency of 65 % from 16 to 256 cores. For the Davidson algorithm, the parallel efficiency is only roughly 50 % from 16 to 256 cores.


next up previous contents index
Next: Parallelization of VASP.4 Up: The installation of VASP Previous: Performance of serial code   Contents   Index
N.B. Requests for support are to be addressed to: vasp.materialphysik@univie.ac.at