Instructions per cycle

In computer architecture, instructions per cycle (IPC) is one aspect of a processor's performance: the average number of instructions executed for each clock cycle. It is the multiplicative inverse of cycles per instruction.[1]

Explanation

Calculation of IPC

The number of instructions per second and floating point operations per second for a processor can be derived by multiplying the number of instructions per cycle with the clock rate (cycles per second given in Hertz) of the processor in question. The number of instructions per second is an approximate indicator of the likely performance of the processor.

The number of instructions executed per clock is not a constant for a given processor; it depends on how the particular software being run interacts with the processor, and indeed the entire machine, particularly the memory hierarchy. However, certain processor features tend to lead to designs that have higher-than-average IPC values; the presence of multiple arithmetic logic units (an ALU is a processor subsystem that can perform elementary arithmetic and logical operations), and short pipelines. When comparing different instruction sets, a simpler instruction set may lead to a higher IPC figure than an implementation of a more complex instruction set using the same chip technology; however, the more complex instruction set may be able to achieve more useful work with fewer instructions.

Factors governing IPC

A given level of instructions per second can be achieved with a high IPC and a low clock speed (like the AMD Athlon and Intel's Core Series), or from a low IPC and high clock speed (like the Intel Pentium 4 and to a lesser extent the AMD Bulldozer). Both are valid processor designs, and the choice between the two is often dictated by history, engineering constraints, or marketing pressures.

IPC for microarchitectures

CPU Family Dual precision Single precision
Intel Core and Intel Nehalem 4 DP IPC: 2-wide SSE2 addition + 2-wide SSE2 multiplication 8 SP IPC: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge and Intel Ivy Bridge 8 DP IPC: 4-wide AVX addition + 4-wide AVX multiplication 16 SP IPC: 8-wide AVX addition + 8-wide AVX multiplication
Intel Haswell, Intel Broadwell and Intel Skylake 16 DP IPC: two 4-wide FMA instructions 32 SP IPC : two 8-wide FMA instructions
AMD K10 4 DP IPC: 2-wide SSE2 addition + 2-wide SSE2 multiplication 8 SP IPC: 4-wide SSE addition + 4-wide SSE multiplication
AMD Bulldozer, AMD Piledriver and AMD Steamroller 8 DP IPC: 4-wide FMA 16 SP IPC: 8-wide FMA
Intel Atom (Bonnell, Saltwell and Silvermont) 1.5 DP IPC: scalar SSE2 addition + scalar SSE2 multiplication every other cycle

6 SP IPC: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

AMD Bobcat 1.5 DP IPC: scalar SSE2 addition + scalar SSE2 multiplication every other cycle 4 SP IPC: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar 3 DP IPC: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles 8 SP IPC: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle
ARM Cortex-A7 1 DP IPC: one VADD.F64 (VFP) every cycle 2 SP IPC: one VMLA.F32 (VFP) every cycle
ARM Cortex-A9 1.5 DP IPC: scalar addition + scalar multiplication every other cycle 4 SP IPC: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle
ARM Cortex-A15 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
ARM Cortex-A32 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
ARM Cortex-A35 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
ARM Cortex-A53 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
ARM Cortex-A57 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
ARM Cortex-A72 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
Qualcomm Krait 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
Qualcomm Kryo 2 DP IPC: scalar FMA or scalar multiply-add 8 SP IPC: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
IBM PowerPC A2 (Blue Gene/Q), per core 8 DP IPC: 4-wide QPX FMA every cycle (SP elements are extended to DP and processed on the same units)
IBM PowerPC A2 (Blue Gene/Q), per thread 4 DP IPC: 4-wide QPX FMA every other cycle (SP elements are extended to DP and processed on the same units)
Intel Xeon Phi (Knights Corner), per core 16 DP IPC: 8-wide FMA every cycle 32 SP IPC: 16-wide FMA every cycle
Intel Xeon Phi (Knights Corner), per thread (two per core) 8 DP IPC: 8-wide FMA every other cycle 16 SP IPC: 16-wide FMA every other cycle

X86 processors, which have FMA, they also have full AVX and processors, which have AVX they also have full SSE. If you want to check IPC for AVX, see "Intel Sandy Bridge and Intel Ivy Bridge" and if you want to check IPC for SSE, see "Intel Core and Intel Nehalem". If you want to check IPC at higher numbers than 64 bit, you must check the microarchitecture's registers. Wide of registers shows, how big number core of processor can count one time. Remember, that two or more registers can connect together with some instructions, so number of registers is important too. [2]

Computer speed

The useful work that can be done with any computer depends on many factors besides the processor speed. These factors include the processor architecture, the internal layout of the machine, the speed of the disk storage system, the speed of other attached devices, the efficiency of the operating system, and most importantly the high level design of the application software in use.

For users and purchasers of a computer system, instructions per clock is not a particularly useful indication of the performance of their system. For an accurate measure of performance relevant to them, application benchmarks are much more useful. Awareness of its existence is useful, in that it provides an easy-to-grasp example of why clock speed is not the only factor relevant to computer performance.

See also

References

  1. John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau. "Computer architecture: a quantitative approach". 2007.
This article is issued from Wikipedia - version of the 10/17/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.