As explained in post 2 - Bitslicing, bitsliced programs cannot use arithmetic instructions. Instead, arithmetic must be reimplemented in software using bitwise logical instructions. While this certainly increases code size compared to a non-bitsliced code, it’s not obvious how it affects performances: on one hand, a simple arithmetic operation becomes a large (software) circuit, but on the other hand, thanks to bitslicing, this circuit computes m times the operation on m-bit registers.
In this post, we will re-use the example of the adder (cf. post Bitslicing - Arithmetic operations) to evaluate what performances we might expect from bitsliced arithmetic. Recall that a n-bit ripple-carry adder is a chain of n full adders, each composed of 5 instructions, for a total of n*5 instructions. Running this bitsliced adder on m-bit registers computes m additions in parallel, thus amortizing its cost to n*5/m instructions per addition.
On a high-end CPU, this adder is unlikely to execute in exactly n*5
cycles. The number of registers, the superscalarity, and the cache
latency will all impact performances. In this post, we will benchmark
a bitsliced ripple-carry adder against the native add
instructions,
and show that bitsliced additions are less efficient than native ones
on Intel Skylake CPUs.
At the time of writting, the most recent Intel CPUs are derived from the Skylake microarchitecture. All the information provided in this section apply to Kaby Lake and Coffee Lake microarchitectures as well as Skylake. The Skylake CPU is a deeply pipelined microarchitecture (i.e. it can contain many instructions at the same time, all going through different phases), illustrated by the following schema:
The pipeline consists of 2 main phases: the front end retrieves and decodes instructions in-order from the L1 instruction cache, while the out-of-order execution engine actually executes the instructions. The instructions are finally removed in-order from the pipeline by the retiring unit.
Note that this is a simplified view of the Skylake microarchitecure, whose purpose is only to explain what matters to us, and to show which parts we will be focusing on when analyzing the performances of our programs.
The L1 instruction cache contains x86 instructions represented as a sequence of bytes. Those instructions are decoded by the Legacy Decode Pipeline (MITE). The MITE operates in the following way:
up to 16 bytes of instructions are fetched from the L1 instruction cache, and pre-decoded into macro-ops.
up to 6 macro-ops per cycle are delivered to the instruction queue (IQ), which performs macro-fusion: some common patterns are identified and optimized by fusing macro-ops together. For instance, an increment followed by a conditional jump, often found at the end of a loop, may fuse together in a single macro-op.
the IQ delivers up to 5 macro-ops per cycle to the decoders. The decoders convert each macro-ops into one or several μops, which are then sent to the Instruction Decode Queue (IDQ).
The MITE is limited by the fetcher to 16 bytes of instructions per cycle. This translates to 4 or 5 μops per cycle on programs manipulating integer registers, which is enough to maximize the bandwidth of the pre-decoder and decoders. However, SSE, AVX and AVX-512 instructions are often larger. For instance, an addition between two registers is encoded on 2 bytes for integer registers, 4 bytes on SSE and AVX, and 6 bytes on AVX-512. Therefore, on programs using SIMD extensions, the MITE tend to be limited by the fetcher. In order to overcome this, the Decoded Stream Buffer (DSB), a μop cache, can be used to bypass the whole MITE when dealing with sequences of instructions that have already been decoded. The DSB delivers up to 6 μops per cycles, directly to the IDQ.
The execution engine can be divided in 3 phases. The Allocate/Rename phase retrieves µops from the IDQ and sends them to the Scheduler, which dispatchs µops to the execution core. Once µops are executed, they are retired: all ressources allocated for them are freed, and the µops are effectively removed from the pipeline.
While assembly codes can only manipulates 16 general purpose registers
(e.g. rax
, rdx
) and 16 SIMD registers (e.g. xmm0
, xmm1
),
the CPU has hundreds of registers available. The renamer takes care of
renaming architectural registers (i.e. registers manipulated by
the assembly) into micro-architectural registers (the internal
registers of the CPU, also known as physical registers). The renamer
also determines the possible execution ports for each instruction, and
allocates any additional ressources they may need (e.g. buffers for
load
and store
instructions).
Additionally, the renamer recognizes some patterns and executes some instructions without going through the execution core. In particular, it performs:
move elimination: some inter-register mov
can be removed by
carefully mapping architectural registers to physical registers.
zeroing idiom elimination: some commonly used idioms to set a
register to zero are recognized and removed by the renamer. For
instance, a xor
or a subtraction when both operands are the same
register.
The scheduler then dispatches µops out-of-order to the execution units of the execution core. When an instruction could be executed on several execution units, the scheduler chooses one (the algorithm is not specified).
The execution core consists of several execution units, each able to execute some specific type of µops, and accessed through 8 ports. For instance:
Arithmetic and bitwise instructions can execute on ports 0, 1 and 5 (and port 6 for general purpose registers).
Branches can execute on port 0 and 6.
Memory loads can execute on port 2 and 3.
Memory reads can execute on port 4.
Up to 4 general purpose arithmetic instructions can therefore be executed in each cycle. In a loop however, this would not leave any free port for the branch to execute, since the ports used by branch instructions are 0 and 6, also used by arithmetic instructions.
The instructions are then removed from the pipeline in-order by the retiring unit. All ressources allocated for them are freed, and fault and exceptions are handled at that stage.
We consider 3 different additions for our evaluation:
a bitslice ripple-carry adder (presented above). Three variants are used: 8-bit, 16-bit and 32-bit. Since their number of instructions are 5*n, we could expect the 32-bit adder to run in twice more cycles than the 16-bit adder, which itself would run in twice more cycle than 8-bit one. However, the larger the adder, the higher the register pressure. Our benchmark aims at quantifing this effect.
an addition done using a single CPU instruction. For completness, we consider the 8, 16 and 32-bit variants of the addition; all of which should have the same throughput and latency. On general purpose (GP) registers, only one addition can be done with a single instruction. However, SSE instructions allow us to perform 4 32-bit, or 8 16-bit or 16 8-bit additions with a single instruction. This increases throughput without changing latency.
3 independent additions done with 3 CPU instructions. Doing a
single addition per cycle under-utilizes the superscalar
capabilities of modern CPUs. Since up to 3 SSE additions (and 4 on
GP since port 6 of the CPU can execute GP arithmetic and bitwise
instructions but not SIMD instructions) can be done each cycle,
this benchmark should show the maximum throughput achievable by
native add
instructions. Once again, we consider the 8, 16 and
32-bit variants, on both SSE and GP.
We implemented those additions in C, and put each of them in a loop doing 2 billion iterations. We repeat this process 10 times, and report the average of the 10 runs in the next section. The standard deviation being less than 5 percents, we omit it in order to reduce clutter.
We compiled our C codes using Clang 7.0.0. We tried using GCC 8.3.0, but it has a hard time with instruction scheduling and register allocations, especially within loops, and thus generates sub-optimal codes. Finally, we ran the benchmarks on a Intel Skylake i5-6500.
In the following table, we report the cycles per iteration and cycles
per addition. bitslice
is the bitsliced addition using a
ripple-carry adder. native_single
is a single native additions using
a single instruction. native_parallel
is the three independent
native additions.
addition type | cycles/iteration | cycles/add |
---|---|---|
(SSE / GP) | (SSE / GP) | |
8-bit bitslice | 14.77 / 16.28 | 0.12 / 0.51 |
8-bit native_single | 1.00 / 1.00 | 0.06 / 1.00 |
8-bit native_parallel | 1.00 / 1.00 | 0.02 / 0.33 |
16-bit bitslice | 33.28 / 34.52 | 0.26 / 1.08 |
16-bit native_single | 1.00 / 1.00 | 0.13 / 1.00 |
16-bit native_parallel | 1.00 / 1.00 | 0.04 / 0.33 |
32-bit bitslice | 74.55 / 73.33 | 0.58 / 2.29 |
32-bit native_single | 1.00 / 1.00 | 0.25 / 1.00 |
32-bit native_parallel | 1.00 / 1.00 | 0.08 / 0.33 |
native_single
)According to Intel’s manual, additions (both SSE and GP ones) should execute in one cycle. There should be no overhead from the loop: looping requires incrementing a counter and doing a conditional jump. Those two instructions get macro-fused together and execute on port 0 or 6. The SSE addition can execute on either port 0, 1 or 5, and can thus be done at the same time as the loop increment/jump.
Experimentally, we get a 1 cycle/iteration throughput, regardless of
the size of the addition (8-bit, 16-bit or 32-bit). On GP registers,
this means 1 cycle per addition. On SSE registers however, a single
padd
instruction performs multiple additions, thus reducing the cost
per addition: 0.25 cyles for 32-bit, 0.13 cycles for 16-bit and 0.06
cycles for 8-bit.
native_parallel
)Since SSE addition can execute on either port 0, 1 or 5, three of them could be done in a single cycle, provided that they don’t have dependencies (i.e. none uses the output of another one). The increment and jump of the loop would fuse and be executed on port 6, independently of the additions.
Once again, this corresponds to the numbers we observe experimentally:
1 cycle/iteration regarderless of the size of the additions. Since 3
padd
instructions are executed in each cycle, the cost per addition
is a third of native_single
’s: 0.08 for 32-bit, 0.04 for 16-bit and
0.02 for 8-bit.
bitslice
)Recall that the n-bit ripple-carry adder contains n full-adders. In practice, 3 operations can be omitted from the first full-adder, since it always receives a 0 as input carry. Likewise, 3 more operations can be saved from the last full-adder, since its output carry is never used and does not need to be computed. We did not have to implement those optimizations by hand since Clang is able to find them by itself. The total numbers of operations of this n-bit ripple-carry adder is therefore n*5-6. Those operations are solely bitwise instructions, which can be executed on ports 0, 1 and 5 on SSE registers (and on port 6 as well on GP registers). This adder is therefore bound to run in at least (n*5-6)/3 cycles on SSE registers, and (n*5-6)/4 cycles on GP registers.
In practice, this number can never be reached because a full-adder contains inner dependencies that prevent it from executing at a rate of 3 instructions per cycle. However, when computing back consecutive full-adders, a full-adder can start executing before the previous one is done, thus bypassing this dependencies issue. The limiting factor then becomes the CPU scheduler, which cannot hold arbitrarily many µops, and limits the out-of-order execution. The larger the adder, the more full-adders (and thus µops) they contain, and the more the scheduler will be saturated.
The following graph shows the minimal theoretical speed ((n*5-6)/3 cycles) and the measured speed on SSE and AVX of ripple-carry adders depending on their sizes (from 4-bit to 64-bit):
Small adders (around 4-bit) almost reach their theoretical minimal speeds. However, the larger the adders, the slower they get compared to their theoretical minimum. This is partly because of their inner dependencies, as mentionned above, but also due to spilling. Indeed, a bitsliced n-bit adder has n*2 inputs, n outputs, and n*5 temporary variables. Even though most of the temporary variables are short lived, the register pressure remains high: there are only 16 SSE registers available and 15 GP registers (since one register is always kept for the stack pointer). Without surprises, the larger the adders, the more registers they need, and the more they suffer from spilling.
Especially on small adders, AVX are faster than SSE, because the
latter use destructive 2-operand instructions, while the former use
non-destructive 3-operand instructions: SSE bitwise instructions take
two registers, and override one of them with the result (e.g. a SSE
xor
will compute x ^= y
), while AVX instructions set a third
registers with their output (e.g. an AVX xor
will compute x = y ^
z
). Some variables must therefore be saved (using a mov
) before
being overwritten with a new result. Even though mov
es execute on
ports 2, 3 and 4, whereas bitwise instructions use ports 0, 1 and 5,
this introduces dependencies, and is enough to slightly slow down the
SSE adders compared to the AVX adders.
Since the CPU can do 4 GP bitwise instructions per cycle, and only 3 SSE (port 6 has only a general purpose ALU, not a vectorized one), the GP adders should be faster than the SSE ones. However, the SSE adders use 16 registers, while the GP ones use only 14 registers: one register is reserved to store the stack pointer, and another one keeps the loop counter. This cause the GP adders to do more spilling than the SSE ones. Furthermore, the inner dependencies of the adders are the same regardless of the registers used. Since already the SSE adders struggle to fully saturate 3 ports, the GP adders have an even harder time to saturate 4 ports. Overall, this causes the SSE and GP adders to have similar performances.
On general purpose registers, using bitslicing can improve the speed
of small additions: an 8-bit bitsliced adder is twice faster than
native add
instructions. However, when SIMD extensions are
available, bitslicing becomes 2 to 6 times slower than native
instructions. The choice of using bitslicing on codes using additions
should therefore depend on both the size of the additions and the
architecture.
Furthermore, the other parts of the ciphers should impact the decision: a code mixing additions and permutations might be a good target for bitslicing, while a code containing solely additions is unlikely to be a good candidate for bitslicing.
On the other hand, it would be pointless to condiser bitslicing a multiplication: a multiplier requires at least n² instructions (whereas an adder only requires n*5).