The benchmark numbers given here have been measured with a benchmark which was designed to mimic the behavior of VASP. It consists of three parts - one measuring matrix-matrix performance (LINCOM TPP), one matrix-vector performance and one performance of 3d-FFT (FFTTEST), the mixture of all parts is similar to what one would encounter on a simulation of a rather large system (60-100 transition metal atoms). for the Matrix Matrix performance DGEMM is used, for Matrix vector DGEMV, do-loops or DGEMM results are taken (depending one where the machine scores highest). FFTTEST tests 3d-FFT and either uses an optimized routine supplied by the manufacturer, or a routine written and optimized by J. Furthmüller
It can be seen that no machine matches the IBM performance
for matrix vector operations. Whereas most
players outperform the IBM on matrix matrix operations.
In fact the IBM behaves like a small CRAY, all routines execute
almost exactly 4 times slower than on the CRAY.
In addition IBM supplies highly optimized FFT-routines;
as matter of fact, only on IBM machines the manufacturer supplied
routines outperform the libraries optimized by J. Furthmüller.
|IBM RS6000||IBM RS6000||IBM RS6000||IBM RS6000||IBM RS6000|
|Lincom-TPP||40.6 s||42.7 s||25.0 s||21.4 s||17.8 s|
|matrix-vec||32.3 s||40.4 s||32.3 s||19.4 s||15.3 s|
|fft||31.4 s||35.0 s||24.0 s||17.3 s||14.4 s|
|TOTAL||103 s||117 s||81.3 s||58.3 s||47.5 s|
|Power C.||Origin||ev5/530||ev5/530||C 180||PII 400|
|Lincom-TPP||32.0 s||22.0 s||21.8 s||14.3 s||25.0 s||40 s|
|matrix-vec||90.2 s||31.0 s||42.0 s||48.8 s||40.0 s||44 s|
|fft||41.0 s||17.0 s||26.1 s||17.8 s||25.0 s||34 s|
|TOTAL||163 s||70 s||90 s||81 s||88 s||118 s|
|CRAY T3D||CRAY T3E||CRAY||CRAY||VPP|
|lincom-tpp||99.5 s||25 s||12.0 s||53 s||7.1 s|
|matrix-vec||110.0 s||33 s||8.3 s||74 s||5.0 s|
|fft||174.0 s||42 s||6.9 s||43 s||5.4 s|
|TOTAL||400 s||100 s||27.2 s||170 s||17.5 s|
VASP.4.4, hardware data streaming enabled; bench.Hg is running on
4 nodes, all other data per node
system equiped with 2 (first) or 4 (second) memory boards.
second value is for 4 nodes
You can test your own machine by compiling ffttest and dgemmtest in the vasp.4.X (X>3) directory, and typing
dgemmtest <lincom.table dgemmtest <rpro.table ffttestThis will run the tests ``LINCOM-TPP'', ``matrix-vec'' and ``fft'' in this order.
The table also shows the timings achieved with the bench.Hg.tar benchmark , which is located on the vasp server. The timings are those written in the line ``LOOP+'' in the OUTCAR file (type: grep 'LOOP+' OUTCAR). Recent algorithmical improvements towards more memory locality (NSIM=4) make the matrix-vector part less important. Machines like the DEC Alphas or SGI Power Challenge, which have a very fast or fast CPU respectively and a small memory band width, benefit most from these improvements. In addition for the bench.Hg benchmark, the performance of the matrix-matrix part plays a more significant rule than in the synthetic benchmark.