The hard way of measuring FLOPS is to modify your program so that it itself keeps track of the number of floating operations performed in each module/function, run it on your target hardware and finally divide the two numbers. But, this requires possibly extensive modification to the program, and if it is done at too granular a level (i.e., in too tight a loop) it can affect the performance of the program.

A much easier way of measuring FLOPS for a particular combination of program and hardware is to use the CPU performance counters, now very conveniently accessible under Linux using the perf tools. In particular my Intel CPU can count an event FP_COMP_OPS_EXE which stands I guess for the Floating Point Computations Operations Executed. There are five umasks of this event:

  • X87: traditional 8087 style 80bit floating point operations

  • SSE_FP_PACKED_DOUBLE: SSE double-precision on packed data (128 bit registers, so this is two operations)

  • SSE_FP_SCALAR_SINGLE: one single-precision operation

  • SSE_PACKED_SINGLE: four single-precision operation (32bit single precision packed into 128 bit register)

  • SSE_SCALAR_DOUBLE: one double-precision operation

These events can be turned into codes to be monitored as explained here, leading in my case to the following output::

Detected PMU models:
	[18, ix86arch, "Intel X86 architectural PMU"]
	[51, perf, "perf_events generic PMU"]
	[68, snb, "Intel Sandy Bridge"]
Total events: 2332 available, 166 supported
Requested Event: FP_COMP_OPS_EXE:X87
Actual    Event: snb::FP_COMP_OPS_EXE:X87:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x530110
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x531010
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x532010
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_PACKED_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x534010
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x538010

The resulting codes are supplied to the perf stat program, and resulting events added up to give the total number of operations, which is then divided by the total time taken. For example, for a carefully tuned program, I measure the following::

 xcorrelators-bench/CPU-correlator$ perf stat -e r530110 -e r531010 -e r532010 -e r534010 -e r538010  ./correlator 
 total maxFlops with 2 threads is: 18.5955
 correlate took 1.09114 s, max Gflops = 18.5955, achieved 17.8384 Gflops, 95.9282 % efficiency
 throughput: 6.9222 GB/s load, 4.6148 MB/s store

  Performance counter stats for './correlator':

             32,693 r530110                                                      [80.01%]
                  0 r531010                                                      [79.99%]
                  0 r532010                                                      [80.01%]
     39,195,349,051 r534010                                                      [80.02%]
                 17 r538010                                                      [80.02%]

        8.015465141 seconds time elapsed

This shows that the program performed close to 40 billion packed single-precision operations, i.e., 160 billion total single precision ops during its 8 second run-time, leading to direct estimate of 20 GFLOPS of practical performance in this case.

Note added November 2019: See also the article on measuring FLOPs from Python – this is far more convenient for Python based numerical applications