Blue Gene/Q Performance Test

From CCI User Wiki
Jump to: navigation, search

Function Call Overhead

Discussion

The following tests measure performance of non-inlined function calls on the BGQ and a 2009 AMD x86_64 system. Test results show between 3.7 times to 42 times decrease in performance on the BGQ than on the x86_64 system. Test 3, with the largest performance decrease, executes an integer comparison and increment within each recursive call. The smallest performance decrease is observed with Test 2 where each function call executes a floating point add, divide, then subtract.

Test 4 indicates that the majority of function call overhead is the result function returns. Given the same total number of instructions (BGPM event counter PEVT_INST_ALL) the branchReturn.c test, which calls the assembly instruction blr to initiate a function return, is 3 times slower then the add.c test, which only calls integer add instructions. This slower performance appears to be caused by branching cycle penalties: five cycles for a branch predicted to be taken, and at least 13 cycles for a mispredicted branch. The details of these penalties is described in the A2 Processor Manual:

  • Section D.2.7.2 states, (emphasis added) "Because the IU0 stage speculatively follows the not-taken path, the taken-branch also flushes any following fetches, and redirects the thread to the predicted taken fetch address. This taken-branch redirection requires five cycles from the fetch of the branch to fetch of the predicted target."
  • Section D.2.7.4 states, (emphasis added) "The target and direction of branches is computed in EX1. In most cases, mispredicted branches are detected, and a mispredicted branch flush is generated, in EX2. For a mispredicted branch, there are a minimum of 13 cycles from when the branch is fetched (or issued) until the correct target is fetched (or issued). In the case of a mispredicted return instruction, the mispredicted branch is not detected until the next mis-speculated instruction reaches EX5."

For increased performance enable the higher level compiler optimizations that include inter-procedural analysis and automatic function inlining. For the XL compilers the '-O5 -qstrict -qinline' flags will enable these features without altering the semantics of the program. Please refer to the XL 12.1 compiler reference for more details on the optimization levels. Similar information is also available via the compiler man pages; i.e. man bgxlc.

Test 1

The following test measures the time taken to call sprintf 10e6 times.

Code

test.c

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

#define COUNT 1000*1000
#define ITERS 10

int main(int argc, char** argv) {
  int i, n;
  double t0, t1;
  char buff[256];
  MPI_Init(&argc,&argv);
  printf("COUNT %d ITERS %d\n", COUNT, ITERS);
  t0 = MPI_Wtime();
  for (i=0; i < ITERS; ++i) {
    for (n=0; n < COUNT; ++n) {
      sprintf(buff, "What time does the narwhal bacon?: %d %f", n, (double)n);  
    }
  }
  t1 = MPI_Wtime();
  printf("sprintf time %f seconds\n",t1-t0);
  MPI_Finalize();
  return 0;
}

Build

 module load gnu
 mpicc -O3 test.c -o test.O3gnu

Run

runGNU.sh

 #!/bin/bash
 module load gnu
 srun ./test.O3gnu

submit

 sbatch -t 5 -n 1 runGNU.sh

Results

The test runs ~13 times faster on the x86_64 machine remus than the BGQ.

SCOREC remus, GCC 4.4.5 compiler:

 COUNT 1000000 ITERS 10
 sprintf time 7.561345 seconds

CCI BGQ, XL 12.01.0000.0000 compiler:

 COUNT 1000000 ITERS 10
 sprintf time 99.341142 seconds

CCI BGQ, GNU compiler:

 COUNT 1000000 ITERS 10
 sprintf time 101.557778 seconds

ANL Vesta BGQ, XL 12.01.0000.0004 compiler:

 COUNT 1000000 ITERS 10
 sprintf time 104.619893 seconds

ANL Vesta BGQ, Clang 3.4 compiler:

 COUNT 1000000 ITERS 10
 sprintf time 105.142043 seconds

Test 2

The following test measures the time taken to make 200e6 calls to the function foo.

Listed below is an excerpt of the x86_64 GCC 4.4.5 level 3 optimized assembly code output from 'objdump -S'. Note that the function 'foo' is not inlined.

...
00000000004008e0 <foo>:
  4008e0:       66 0f 28 c8             movapd %xmm0,%xmm1
  4008e4:       f2 0f 58 c8             addsd  %xmm0,%xmm1
  4008e8:       f2 0f 59 05 c0 01 00    mulsd  0x1c0(%rip),%xmm0        # 400ab0 <_IO_stdin_used+0x8>
  4008ef:       00 
  4008f0:       f2 0f 58 c1             addsd  %xmm1,%xmm0
  4008f4:       c3                      retq
  4008f5:       90                      nop 
  ...
  4008ff:       90                      nop

0000000000400900 <main>:
  ...
  400930:       e8 a3 fe ff ff          callq  4007d8 <MPI_Wtime@plt>
  400935:       f2 0f 10 0d ab 01 00    movsd  0x1ab(%rip),%xmm1        # 400ae8 <_IO_stdin_used+0x40>
  40093c:       00 
  40093d:       f2 0f 11 44 24 18       movsd  %xmm0,0x18(%rsp)
  400943:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400948:       66 0f 28 c1             movapd %xmm1,%xmm0
  40094c:       83 c3 01                add    $0x1,%ebx
  40094f:       e8 8c ff ff ff          callq  4008e0 <foo>
  400954:       81 fb 00 c2 eb 0b       cmp    $0xbebc200,%ebx
  40095a:       66 0f 28 c8             movapd %xmm0,%xmm1
  40095e:       75 e8                   jne    400948 <main+0x48>
  400960:       f2 0f 11 04 24          movsd  %xmm0,(%rsp)
  400965:       e8 6e fe ff ff          callq  4007d8 <MPI_Wtime@plt>
  ...
...


Code

foo.c

double foo(double d) {
  double e = d;
  e += d;
  e -= d/2;
  return e;
}

fooDriver.c

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

#define ITERS 200*1000*1000

double foo(double d);

int main(int argc, char** argv)
{
  int i;
  int n;
  double t0;
  double t1;
  double a = 1;
  MPI_Init(&argc,&argv);
  printf("ITERS %d\n", ITERS);
  t0 = MPI_Wtime();
  for (i=0; i < ITERS; ++i) {
    a = foo(a);
  }
  t1 = MPI_Wtime();
  printf("a %f\n", a);
  printf("call foo time %f seconds\n",t1-t0);
  MPI_Finalize();
  return 0;
}

Build

 module load gnu
 mpicc -O3 foo*.c -o fooDriver.O3gnu

Run

runGNU.sh

 #!/bin/bash
 module load gnu
 srun ./fooDriver.O3gnu

submit

 sbatch -t 5 -n 1 runGNU.sh

Results

The test runs ~3.7 times faster on the x86_64 machine remus than the BGQ.

SCOREC remus, GCC 4.4.5 compiler:

 ITERS 200000000
 a -nan
 call foo time 0.962403 seconds

CCI BGQ fen, GNU 4.4.7 compiler:

 ITERS 200000000
 a nan
 call foo time 1.315391 seconds

CCI BGQ, XL 12.01.0000.0000 compiler:

 ITERS 200000000
 a nan
 call foo time 4.781255 seconds

CCI BGQ, GNU compiler:

 ITERS 200000000
 a nan
 call foo time 4.875005 seconds

ANL Vesta BGQ, XL 12.01.0000.0004 compiler:

 ITERS 200000000
 a nan
 call foo time 4.781253 seconds

ANL Vesta BGQ, GNU 4.4.6 mpiler:

 ITERS 200000000
 a nan
 call foo time 4.875000 seconds

ANL Vesta BGQ, Clang 3.4 compiler:

 ITERS 200000000
 a nan
 call foo time 3.640630 seconds

Kaust BGP, XL 09.00.0000.0016 compiler *:

 ITERS 200000000
 a nan
 call foo time 4.705884 seconds

Kaust BGP, GCC 4.1.2 compiler *:

 ITERS 200000000
 a nan
 call foo time 4.705884 seconds
  • Multiple runs of the Kaust tests resulted in the same execution time.

Test 3

The following test measures the time taken to make 1000 recursive calls.

Listed below is an excerpt from 'objdump -S' on the binary produced from x86_64 GCC 4.4.5 with optimization level 3. Note that the functions 'a' and 'b' are not inlined.

...
0000000000400900 <a>:
  400900:	81 ff e8 03 00 00    	cmp    $0x3e8,%edi
  400906:	74 08                	je     400910 <a+0x10>
  400908:	83 c7 01             	add    $0x1,%edi
  40090b:	e9 80 00 00 00       	jmpq   400990 <b>
  400910:	f3 c3                	repz retq 
  400912:	66 66 66 66 66 2e 0f 	nopw   %cs:0x0(%rax,%rax,1)
  400919:	1f 84 00 00 00 00 00 

...

0000000000400920 <main>:
  ...
  40093a:	e8 a1 fe ff ff       	callq  4007e0 <MPI_Init@plt>
  40093f:	e8 bc fe ff ff       	callq  400800 <MPI_Wtime@plt>
  400944:	f2 0f 11 44 24 08    	movsd  %xmm0,0x8(%rsp)
  40094a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  400950:	31 ff                	xor    %edi,%edi
  400952:	83 c3 01             	add    $0x1,%ebx
  400955:	e8 36 00 00 00       	callq  400990 <b>
  40095a:	81 fb a0 86 01 00    	cmp    $0x186a0,%ebx
  400960:	75 ee                	jne    400950 <main+0x30>
  400962:	e8 99 fe ff ff       	callq  400800 <MPI_Wtime@plt>
  ...
...

0000000000400990 <b>:
  400990:	81 ff e8 03 00 00    	cmp    $0x3e8,%edi
  400996:	74 08                	je     4009a0 <b+0x10>
  400998:	83 c7 01             	add    $0x1,%edi
  40099b:	e9 60 ff ff ff       	jmpq   400900 <a>
  4009a0:	f3 c3                	repz retq 
  4009a2:	90                   	nop
  ...
  4009af:	90                   	nop
...



Code

a.c

#include <mpi.h>
#include <stdio.h>

void b(unsigned i);

void a(unsigned i)
{
  if (i==1000)
    return;
  else
    b(i+1);
}

int main(int argc, char** argv)
{
  int i;
  double t0,t1;
  MPI_Init(&argc,&argv);
  t0 = MPI_Wtime();
  for (i=0; i < 100*1000; ++i)
    b(0);
  t1 = MPI_Wtime();
  printf("time: %f\n",t1-t0);
  MPI_Finalize();
  return 0;
}

b.c

void a(unsigned i);

void b(unsigned i)
{
  if (i==1000)
    return;
  else
    a(i+1);
}


Build

 module load gnu
 mpicc -O3 [ab].c -o test.O3gnu

Run

runGNU.sh

 #!/bin/bash
 module load gnu
 srun ./test.O3gnu

submit

 sbatch -t 5 -n 1 runGNU.sh

Results

The test runs ~42 times faster on the x86_64 machine remus than the BGQ.

SCOREC remus, GCC 4.4.5 compiler:

 time: 0.094465

CCI BGQ fen, GNU 4.4.7 compiler:

 time: 0.557759

TACC Stampede Phi, Intel 13.1.0 compiler:

 time: 0.736386

Kaust BGP, XL 09.00.0000.0016 compiler:

 time: 1.967278

Kaust BGP, GCC 4.1.2 compiler:

 time: 1.967247

CCI BGQ, XL 12.01.0000.0000 compiler:

 time: 4.386511

CCI BGQ, GNU 4.4.6 compiler:

 time: 4.671912

ANL Vesta BGQ, XL 12.01.0000.0004 compiler:

 time: 4.386181

ANL Vesta BGQ, GNU 4.4.6 compiler:

 time: 4.671456

ANL Vesta BGQ, Clang 3.4 compiler:

 time: 4.017650

Test 4

The following test compares the time taken to execute four integer add assembly instructions and to a branch and a branch return instruction. BGPM counters record the number of clock cycles, number of instructions executed, and number of XU (integer/load/store) and AXU (floating point) instructions.

Code

add.c

#include <mpi.h>
#include <stdio.h>

#include <bgpm/include/bgpm.h>

int main(int argc, char** argv)
{
  int i;
  double t0,t1;
  MPI_Init(&argc,&argv);

  Bgpm_Init(BGPM_MODE_SWDISTRIB);
  // Create event set and add events needed to calculate CPI,
  int hEvtSet = Bgpm_CreateEventSet();
  unsigned evtList[] =
        { PEVT_CYCLES,
          PEVT_INST_ALL,
          PEVT_LSU_COMMIT_CACHEABLE_LDS,
          PEVT_L1P_BAS_MISS,
          PEVT_INST_XU_ALL,
          PEVT_INST_QFPU_ALL,
          PEVT_INST_QFPU_FPGRP1
        };
  Bgpm_AddEventList(hEvtSet, evtList, sizeof(evtList)/sizeof(unsigned) );
  Bgpm_Apply(hEvtSet);

  __asm__("li r21,1\nli r20,1");
  t0 = MPI_Wtime();
  Bgpm_Start(hEvtSet);
  __asm__("add r21,r21,r20\n"
          "add r21,r21,r20\n"
          "add r21,r21,r20\n"
          "add r21,r21,r20");
  Bgpm_Stop(hEvtSet);
  t1 = MPI_Wtime();
  printf("time: %f\n",t1-t0);

  printf("Counter Results:\n");
  int numEvts = Bgpm_NumEvents(hEvtSet);
  uint64_t cnt;
  for (i=0; i<numEvts; i++) {
     Bgpm_ReadEvent(hEvtSet, i, &cnt);
     printf("    0x%016lx (%ld) <= %s\n", cnt, cnt, Bgpm_GetEventLabel(hEvtSet, i));
  }

  uint64_t numCycles = 0;
  Bgpm_ReadEvent(hEvtSet, 0, &numCycles);
  uint64_t numInst = 0;
  Bgpm_ReadEvent(hEvtSet, 1, &numInst);
  uint64_t numXU = 0;
  Bgpm_ReadEvent(hEvtSet, 4, &numXU);
  uint64_t numFP = 0;
  Bgpm_ReadEvent(hEvtSet, numEvts-1, &numFP);
  printf("inst/cycle %0.3f\n", (double)numInst / (double)numCycles);
  printf("xu/cycle %0.3f\n", (double)numXU / (double)numCycles);
  printf("xu/inst %0.3f\n", (double)numXU / (double)numInst);
  printf("fp/cycle %0.3f\n", (double)numFP / (double)numCycles);
  printf("fp/inst %0.3f\n", (double)numFP / (double)numInst);

  MPI_Finalize();
  return 0;
}

branchReturn.c

#include <mpi.h>
#include <stdio.h>

#include <bgpm/include/bgpm.h>

int main(int argc, char** argv)
{
  int i;
  double t0,t1;
  MPI_Init(&argc,&argv);

  Bgpm_Init(BGPM_MODE_SWDISTRIB);
  // Create event set and add events needed to calculate CPI,
  int hEvtSet = Bgpm_CreateEventSet();
  unsigned evtList[] =
        { PEVT_CYCLES,
          PEVT_INST_ALL,
          PEVT_LSU_COMMIT_CACHEABLE_LDS,
          PEVT_L1P_BAS_MISS,
          PEVT_INST_XU_ALL,
          PEVT_INST_QFPU_ALL,
          PEVT_INST_QFPU_FPGRP1
        };
  Bgpm_AddEventList(hEvtSet, evtList, sizeof(evtList)/sizeof(unsigned) );
  Bgpm_Apply(hEvtSet);

  t0 = MPI_Wtime();
  Bgpm_Start(hEvtSet);
   __asm__("bl test\n"
            "nop\n"
            "b test1\n"
            "test:\n"
            "blr\n"
            "test1:\n");
  Bgpm_Stop(hEvtSet);
  t1 = MPI_Wtime();
  printf("time: %f\n",t1-t0);

  printf("Counter Results:\n");
  int numEvts = Bgpm_NumEvents(hEvtSet);
  uint64_t cnt;
  for (i=0; i<numEvts; i++) {
     Bgpm_ReadEvent(hEvtSet, i, &cnt);
     printf("    0x%016lx (%ld) <= %s\n", cnt, cnt, Bgpm_GetEventLabel(hEvtSet, i));
  }

  uint64_t numCycles = 0;
  Bgpm_ReadEvent(hEvtSet, 0, &numCycles);
  uint64_t numInst = 0;
  Bgpm_ReadEvent(hEvtSet, 1, &numInst);
  uint64_t numXU = 0;
  Bgpm_ReadEvent(hEvtSet, 4, &numXU);
  uint64_t numFP = 0;
  Bgpm_ReadEvent(hEvtSet, numEvts-1, &numFP);
  printf("inst/cycle %0.3f\n", (double)numInst / (double)numCycles);
  printf("xu/cycle %0.3f\n", (double)numXU / (double)numCycles);
  printf("xu/inst %0.3f\n", (double)numXU / (double)numInst);
  printf("fp/cycle %0.3f\n", (double)numFP / (double)numCycles);
  printf("fp/inst %0.3f\n", (double)numFP / (double)numInst);

  MPI_Finalize();
  return 0;
}

Build

 module load xl
 XLFLAGS="-O0 -qstrict -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qreport -qhot "
 BGPMLINK="-L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm "
 mpicc $XLFLAGS add.c -o add.xl $BGPMLINK
 mpicc $XLFLAGS branchReturn.c -o branchReturn.xl $BGPMLINK

Run

run.sh

 #!/bin/bash -ex
 module load xl
 srun ./add.xl
 srun ./branchReturn.xl

submit

 sbatch -t 5 -n 1 run.sh

Results

CCI BGQ, XL 12.01.0000.0000 compiler:

 + module load xl
 ++ /usr/bin/modulecmd bash load xl
 + eval
 + XLFLAGS='-O0 -qstrict -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qreport -qhot '
 + BGPMLINK='-L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm '
 + mpicc -O0 -qstrict -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qreport -qhot add.c -o add.xl -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm
 1506-332 (W) Option -qprefetch=aggressive is ignored because option -O3 -qhot=level=1 is not specified.
 + mpicc -O0 -qstrict -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qreport -qhot branchReturn.c -o branchReturn.xl -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm
 1506-332 (W) Option -qprefetch=aggressive is ignored because option -O3 -qhot=level=1 is not specified.
 + srun ./add.xl
 time: 5.078130
 Counter Results:
   0x00000001e449b406 (8125002758) <= PEVT_CYCLES
   0x00000000f5de81a4 (4125000100) <= PEVT_INST_ALL
   0x000000000000001c (28) <= PEVT_LSU_COMMIT_CACHEABLE_LDS
   0x0000000000000009 (9) <= PEVT_L1P_BAS_MISS
   0x00000000f5de81a4 (4125000100) <= PEVT_INST_XU_ALL
   0x0000000000000000 (0) <= PEVT_INST_QFPU_ALL
   0x0000000000000000 (0) <= PEVT_INST_QFPU_FPGRP1
 inst/cycle 0.508
 xu/cycle 0.508
 xu/inst 1.000
 fp/cycle 0.000
 fp/inst 0.000
 + srun ./branchReturn.xl
 time: 15.468761
 Counter Results:
   0x00000005c3372ebc (24750010044) <= PEVT_CYCLES
   0x00000000f5de81a5 (4125000101) <= PEVT_INST_ALL
   0x000000000000001c (28) <= PEVT_LSU_COMMIT_CACHEABLE_LDS
   0x0000000000000017 (23) <= PEVT_L1P_BAS_MISS
   0x00000000f5de81a5 (4125000101) <= PEVT_INST_XU_ALL
   0x0000000000000000 (0) <= PEVT_INST_QFPU_ALL
   0x0000000000000000 (0) <= PEVT_INST_QFPU_FPGRP1
 inst/cycle 0.167
 xu/cycle 0.167
 xu/inst 1.000
 fp/cycle 0.000
 fp/inst 0.000