Vasp currently offers parallelization (and data distribution) over bands and parallelization (and data distribution) over plane wave coefficients (see also Section 4). To get a high efficiency on massively parallel systems it is strongly recommended to use both at the same time. The only algorithm which works with the over band distribution is the RMM-DIIS iterative matrix diagonalization (IALGO=48). The conjugate gradient band-by-band method (IALGO=8) is only supported for parallelization over plane wave coefficients.
NPAR tells vasp to switch on parallelization (and data distribution) over bands. The default NPAR=1, and means distribution over plane wave coefficients only (IALGO=8 and IALGO=48 both work), All nodes will work on each band. We suggest to use this default setting only when running on a small number of nodes. For NPAR=(total number of nodes), each band will be treated by only one node. This can improve the performance for platforms with a small communication bandwidth, however it also increases the memory requirements considerably, because the non local projector functions must be stored in that case on each node. In addition a lot of communication is required to orthogonalize the bands. If NPAR is neither 1, nor equal to the number of nodes, the number of nodes working on one band is given by
The second switch which influences the data distribution is LPLANE. If LPLANE is set to .TRUE. in the INCAR file, the data distribution in real space is done plane wise. Any combination of NPAR and LPLANE can be used. Generally, LPLANE=.TRUE. reduces the communication band width during the FFT's, but at the same time it unfortunately worsens the load balancing on massively parallel machines. LPLANE=.TRUE. should only be used if NGZ is at least 3*(number of nodes)/NPAR, and optimal load balancing is achieved if NGZ=n*NPAR, where n is an arbitrary integer. If LPLANE=.TRUE. and if the real space projector functions (LREAL=.TRUE. or ON or AUTO) are used, it might be necessary to check the lines following
real space projector functions total allocation : max/ min on nodes :The max/ min values should not differ too much, otherwise the load balancing might worsen as well.
The optimum setting of NPAR and LPLANE depends very much on the type of machine you are running. Here are a few guidelines
Usually one is running on a relatively small number of nodes, so that load balancing is no problem. Also the communication band width is reasonably good on SGI power challenge machines. Best performance is often achived with
LPLANE = .TRUE. NPAR = 1 NSIM = 1Increasing NPAR usually worsens performance. For NPAR=1 we have in fact observed a superlinear scaling w.r.t. the number of nodes in many cases. This is due to the fact that the cache on the SGI power challenge machines is relatively large (4 Mbytes); if the number of nodes is increased the real space projectors (or reciprocal projectors) can be kept in the cache and therefore cache misses decrease significantly if the number of nodes are increased.
LPLANE = .TRUE. NPAR = 4 NSIM = 4Contrary to the SGI Power Challenge superlinear scaling could not be observed, obviously because data locality and cache reusage is no only of minor importance on the Origin 2000.
In summary the following setting is recommended
LPLANE = .FALSE. NPAR = sqrt(number of nodes) NSIM = 1