Once F90 has been introduce it was much easier to do the parallelization of vasp. One structure at the heart of vasp is for instance the grid structure (which is required to describe 3-dimensional grids). Here is a slightly simplified version of the structure found in the mgrid.inc file:
TYPE grid_3d !only GRID INTEGER NGX,NGY,NGZ ! number of grid points in x,y,z INTEGER NPLWV ! total number of grid points INTEGER MPLWV ! allocation in complex words TYPE(layout) :: RC ! reciprocal space layout TYPE(layout) :: IN ! intermediate layout TYPE(layout) :: RL ! real space layout ! mapping for parallel version TYPE(grid_map) :: RC_IN ! recip -> intermeadiate comm. TYPE(grid_map) :: IN_RL ! intermeadiate -> real space comm. TYPE(communic), POINTER :: COMM ! opaque communicatorNGX, NGY, NGZ describes the number of grid points in x, y and z direction, and NPLWV the total number of points (i.e. NGX*NGY*NGZ). Most quantities (like charge densities) are defined on these 3-dimensional grids. In the sequential version NGX, NGY and NGZ were sufficient to perform a three dimensional FFT of quantities defined on these grids. In the parallel version the distribution of data among the processors must also be known. This is achieved with the structures RL and RC, which describe how data are distributed among processors in real and reciprocal space. In vasp data are distributed column wise on the nodes, in reciprocal space the fast index is the first (or x) index and and columns can be indexed by a pair (y,z). In real space the fast index is the z index, columns are indexed by the pair (z,y). In addition the FFT-routine (which performs lots of communication) stores all required setup data in two mapping-structures called RC_IN and IN_RL.
The big advantage of using structures instead of common blocks is that it is trivial to have more than one grid. For instance, vasp uses a coarse grid for the representation of the ultra soft wavefunctions and a second much finer grid for the representation of the augmentation charges. Therefore two grids are defined in vasp one is called GRID (used for the wavefunctions) and other one is called GRIDC (used for the augmentation charges). Actually a third grid exists which has in real space a similar distribution as GRID and in reciprocal space a similar distribution as GRIDC. This third grid (GRID_SOFT) is used to put the soft pseudo charge density onto the finer grid GRIDC.
Vasp currently offers parallelization over bands and parallelization over plane wave coefficients. To get a high efficiency it is strongly recommended to use both at the same time. The only algorithm which works with the over band distribution is the RMM-DIIS matrix diagonalizer (IALGO=48). Conjugate gradient band-by-band method (IALGO=8) is only supported for parallelization over plane wave coefficients.
Parallelization over bands and plane wave coefficients at the same time reduces the communication overhead significantly. To reach this aim a 2 dimensional cartesian communication topology is used in vasp:
node-id's 0 1 2 3 bands 1,5,9,... 4 5 6 7 bands 2,6,10,... 8 9 10 11 etc. 12 13 14 15Bands are distributed among a group of nodes in a round robin fashion, separate communication universe are set up for the communication within one band (in-band communication COMM_INB), and for inter-band communication (COMM_INTER). Communication within one in-band communication group (for instance 0-1-2-3) does not interfere with communication done within another group (i.e. 4-5-6-7). This can be achieved easily with MPI, but we have also implemented the required communication routines with T3D shmem communication.
Overall we have found a very good load balancing and an extremely good scaling in the band-by-band RMM-DIIS algorithm. For the re-orthogonalization and subspace rotation -- which is required from time to time -- the wavefunctions are redistributed from over bands to a over plane wave coefficient distribution. The communication in this part is by the way very small in comparison with the communication required in the FFT's. Nevertheless subspace rotation on massively parallel computer is currently still problematic, mainly because the diagonalization of the NBANDS NBANDS subspace-matrix is extremely slow.
There are some points which should be noted: Parallelization over plane waves means that the non local projection operators must be stored on each in-band-processor group (i.e. nodes 0-1-2-3 must store all real space projection operators). This means relatively high costs in terms of memory, and therefore parallelization over bands should not be done too excessively. Having for instance 64 nodes, we found that it is best to generate a 8 by 8 cartesian communicator. Mind also that the hard augmentation charges are always distributed over ALL nodes, even if parallelization over bands is selected. This was possible using the previously mentioned third grid GRID_SOFT, i.e. this third helper grid allows one to decouple the presentation of the augmentation and ultra soft part.