next up previous contents index
Next: Parallelization of VASP.4 Up: The installation of VASP Previous: Performance of serial code   Contents   Index

N.B. This document is no longer maintained, please visit our wiki.


Performance of parallel code on various machines

For historic reasons, we show the scaling of VASP.4 code on the T3D. The system is l-Fe with a cell containing 64 atoms, the $ \Gamma $ point only was used, the number of plane waves was 12500 and the number of included bands is 384.

cpu's 4 8 16 32 64 128
NPAR 2 4 4 8 8 16
POTLOK: 11.72 5.96 2.98 1.64 0.84 0.44
SETDIJ: 4.52 2.11 1.17 0.61 0.36 0.24
EDDIAG: 73.51 35.45 19.04 10.75 5.84 3.63
RMM-DIIS: 206.09 102.80 52.32 28.43 13.87 6.93
ORTHCH: 22.39 8.67 4.52 2.4 1.53 0.99
DOS : 0.00 0.00 0.00 0.00 0.00 0.00
LOOP: 319.07 155.42 80.26 44.04 22.53 12.39
$ t / t_{opt}$   100 $ \%$ 99 $ \%$ 90 $ \%$ 90 $ \%$ 80 $ \%$

Figure 1: Scaling for a 256 Al system.
3mm
\includegraphics[width=9cm clip=true]{origin_new.eps}
The main problem with the current algorithm is the sub space rotation. Sub space rotation requires the diagonalization of a relatively small matrix (in this case $ 384 \times 384$), and this step scales badly on a massively parallel machine. VASP currently uses either scaLAPACK or a fast Jacobi matrix diagonalisation scheme written by Ian Bush (T3D, T3E only). On 64 nodes, the Jacoby scheme requires around 1 sec to diagonalise the matrix, but increasing the number of nodes does not improve the timing. The scaLAPACK requires at least 2 seconds, and scaLAPACK reaches this performance already with 16 nodes.

Figure 2: Scaling of bench.PdO on a PC cluster with Gigabit ethernet.
3mm
\includegraphics[width=12cm clip=true]{scalePdO_3.2G.eps}
Fig. 2 shows a more representative result on an SGI 2000 for 256 Al atoms. Up to 32 nodes an efficiency of 0.8 is found. A similar efficiency can be expected on most current architecture with large communication band-width (Infiniband, Myrinet, SGI etc.). On a Gibgabit ethernet based cluster, you can expect an efficiency of up to 75 % for up to 16-32 cores.

Figure 3: Scaling for a 512 atom GaAs system. The $ \Gamma $ point only version was used and the total number of filled bands is 1024. The default plane wave cutoff of 208 eV was used. Other VASP settings are PREC = A ; ISYM = 0 ; NELMDL = 5 ; NELM = 8 ; LREAL = A . The left panel shows the timing for RMM-DISS (ALGO = V), the right for Davidson (ALGO = N). The time for the 7th SCF step is reported.
3mm
\includegraphics[width=16cm clip=true]{GaAs.eps}
The final figure Fig. 3 shows the scaling for an in-house state of the art machine build by SGI (narwal). The nodes are linked by a QDR Infiniband switch, and each node consists of 8 cores (with two Intel(R) Xeon(R) CPU E5540 CPU's, 2.53GHz). In this case, the RMM-DIIS algorithm shows very good parallel efficiency of 65 % from 16 to 256 cores. For the Davidson algorithm, the parallel efficiency is only roughly 50 % from 16 to 256 cores.



N.B. Requests for support are to be addressed to: vasp.materialphysik@univie.ac.at