next up previous contents
Next: 3.11 Performance of parallel Up: 3 The installation of Previous: 3.9 Performance profile of

3.10 Performance of serial code

The benchmark numbers given here have been measured with a benchmark which was designed to mimic the behavior of VASP. It consists of three parts - one measuring matrix-matrix performance (LINCOM TPP), one matrix-vector performance and one performance of 3d-FFT (FFTTEST), the mixture of all parts is similar to what one would encounter on a simulation of a rather large system (60-100 transition metal atoms). for the Matrix tex2html_wrap_inline4401 Matrix performance DGEMM is used, for Matrix tex2html_wrap_inline4401 vector DGEMV, do-loops or DGEMM results are taken (depending one where the machine scores highest). FFTTEST tests 3d-FFT and either uses an optimized routine supplied by the manufacturer, or a routine written and optimized by J. Furthmüller

It can be seen that no machine matches the IBM performance for matrix tex2html_wrap_inline4401 vector operations. Whereas most players outperform the IBM on matrix tex2html_wrap_inline4401 matrix operations. In fact the IBM behaves like a small CRAY, all routines execute almost exactly 4 times slower than on the CRAY. In addition IBM supplies highly optimized FFT-routines; as matter of fact, only on IBM machines the manufacturer supplied routines outperform the libraries optimized by J. Furthmüller.

IBM RS6000 IBM RS6000 IBM RS6000 IBM RS6000 IBM RS6000
590 3CT 595 595 397
lincom-TPP(Mflops) 245 237 389 389 580
matrix-vec(Mflops) 110 110 110 300
Lincom-TPP 40.6 s 42.7 s 25.0 s 21.4 s 17.8 s
matrix-vec 32.3 s 40.4 s 32.3 s 19.4 s 15.3 s
fft 31.4 s 35.0 s 24.0 s 17.3 s 14.4 s
TOTAL 103 s 117 s 81.3 s 58.3 s 47.5 s
RATING 1 0.9 1.3 1.8 2.2
bench.Hg 1663 1920 1380 1000 809
SGI SGI DEC-SX DEC-LX HP LINUX-PC
Power C. Origin ev5/530 ev5/530 C 180 PII 400
lincom-TPP(Mflops) 300 430 439 650 400 240
matrix-vec(Mflops) 38 100/150 74/108 67/100 55/80 70/100
Lincom-TPP 32.0 s 22.0 s 21.8 s 14.3 s 25.0 s 40 s
matrix-vec 90.2 s 31.0 s 42.0 s 48.8 s 40.0 s 44 s
fft 41.0 s 17.0 s 26.1 s 17.8 s 25.0 s 34 s
TOTAL 163 s 70 s 90 s 81 s 88 s 118 s
RATING 0.64 1.47 1.12 1.3 1.2 0.9
bench.Hg 2200/653 tex2html_wrap_inline4607 1200/330 tex2html_wrap_inline4607 1424 1140 - 2250 s
CRAY T3D tex2html_wrap_inline4409 CRAY T3E tex2html_wrap_inline4409 CRAY CRAY VPP
ev4 ev5 C90 J90 500
lincom-TPP(Mflops) 96 400 800 188 1500
matrix-vec(Mflops) 28/42 101 459 50 600
lincom-tpp 99.5 s 25 s 12.0 s 53 s 7.1 s
matrix-vec 110.0 s 33 s 8.3 s 74 s 5.0 s
fft 174.0 s 42 s 6.9 s 43 s 5.4 s
TOTAL 400 s 100 s 27.2 s 170 s 17.5 s
RATING 0.25 1.0 4.1 0.6 6.5
bench.Hg 639+

tex2html_wrap_inline4409 VASP.4.4, hardware data streaming enabled; bench.Hg is running on 4 nodes, all other data per node
system equiped with 2 (first) or 4 (second) memory boards.
tex2html_wrap_inline4607 second value is for 4 nodes

You can test your own machine by compiling ffttest and dgemmtest in the vasp.4.X (X>3) directory, and typing

 dgemmtest <lincom.table 
 dgemmtest <rpro.table
 ffttest
This will run the tests ``LINCOM-TPP'', ``matrix-vec'' and ``fft'' in this order.

The table also shows the timings achieved with the bench.Hg.tar benchmark , which is located on the vasp server. The timings are those written in the line ``LOOP+'' in the OUTCAR file (type: grep 'LOOP+' OUTCAR). Recent algorithmical improvements towards more memory locality (NSIM=4) make the matrix-vector part less important. Machines like the DEC Alphas or SGI Power Challenge, which have a very fast or fast CPU respectively and a small memory band width, benefit most from these improvements. In addition for the bench.Hg benchmark, the performance of the matrix-matrix part plays a more significant rule than in the synthetic benchmark.


next up previous contents
Next: 3.11 Performance of parallel Up: 3 The installation of Previous: 3.9 Performance profile of

MASTER USER VASP
Mon Mar 29 10:38:29 MEST 1999