mslicing

Published on 2020-01-16 by Darius Mercadier

As shown in the previous post, bitslicing can produce very high-throughput cipher implementations through data-parallelism, but suffers from a few limitations:

it requires a lot of independent inputs to be efficient, since high throughput is achieved by encrypting different inputs in parallel.
it puts a lot of pressure on the registers, which causes spilling (moving data back-and-forth between registers and memory), thus reducing performances.
it cannot efficiently implement ciphers which rely heavily on arithmetic operations, like Chacha20 [1] and Threefish [2].

To overcome the first two issues, Kasper & Schwabe [3] proposed a “byte-sliced” implementation of AES. Typical bitslicling would represent the 128-bit block as 1 bit in 128 registers. Instead, they represent the 128-bit block as 16 bits in 8 registers. Using only 8 registers greatly reduces register pressure and allows AES to be implemented without any spilling. Furthermore, this representation requires less inputs to fill the registers and thus maximize throughput: on AES, only 8 parallel inputs are required to fill up 128-bit SSE registers (versus 128 with bitslicing).

We define m-slicing as a generalization of bitslicing, where a n-bit input is split into k bits in m registers (such that k * m = n). The special case m = 1 corresponds to bitslicing. When k (the number of bits in each register) is greater than one, there are two possible ways to arrange the bits within each register: they can either be stored contiguously, or they can be splitted. The first option, which we call vslicing, will enable the use of packed instructions (e.g. 16-bit addition), while the second option, which we call hslicing, will enable the use of permutation instructions (e.g. pshufb). This post will explain in detail how each technique works, and how they can be used to improve on bitslicing.

Vertical slicing

Rather than considering the bit as the atomic unit of computation, one may use m-bit words as such basis (m being a word size supported by the SIMD architecture). On Intel (SSE/AVX/AVX512), m could for example be 8, 16, 32, or 64. We can then exploit vertical m-bit SIMD instructions to perform logic as well as arithmetic in parallel. If we visually represent registers as aggregations of m-bit elements vertically stacked, vertical operations consist in element-wise computations along this vertical direction. Such operations include additions, multiplications, shifts, rotations, as well as traditional bitwise operations. For instance, the instruction paddb takes two SSE registers, each seen as 16 8-bit stacked elements, and computes the 16 8-bit additions between each one of their elements:

Similarly, pslld takes an SSE register, seen as 4 32-bit stacked elements, and an integer n, and left-shifts each element of the SSE register by n:

Note that two m-bit elements of the same register never interact together: there is no possbility to add them or to xor them for instance.

This technique, called vertical slicing (or vslicing for short), is similar to the notion of vectorization in the compilation world. Compared to bitslicing, it puts less pressure on registers, and requires less parallel data to fill the registers. Furthermore, it can be used to implement arithmetic-based ciphers (like Chacha20), which cannot be done efficiently in bitslicing.

However, vslicing is weaker than bitslicing when it comes to permutations. Applying an arbitrary bit-permutation to the whole block needs to be done manually using shifts and masks to extracts bits from the registers and recombine them. By comparison, this could be done at compile time with bitslicing.

Vsliced Rectangle

The Rectangle cipher [4] (presented earlier) encrypts a 64-bit input. We can split it into 4 16-bit elements, and store them in the first 16-bit packed elements of 4 SSE registers:

Then, applying the same principle as bitslicing, we can fill the remaining empty elements of those SSE registers with independent inputs. The second input would go in the second packed elements of each register:

And so on until the registers are full (in the case of Rectangle on 128-bit SSE registers, this is achieved with 8 inputs).

Vslice Rectangle can be efficiently computed in parallel:

The Sbox is a sequence of bitwise and, xor, not and or between each of those SSE registers.
The permutation can be written as three left-rotations, each one operating on a different SSE registers. While the SSE extension does not offer a rotation instruction, it can be emulated using two shifts and an or. This is slightly more expensive than the bitsliced implementation of Rectangle, which can perform this permutation at compile time. However, the reduced register pressure more than compensate for this loss.

Horizontal slicing

Rather than considering an m-bit atom as a single packed element, we may also dispatch its m bits into m distinct packed elements, assuming that m is less or equal to the number of packed elements of the architecture. Using this representation, we lose the ability to perform arithmetic operations but gain the ability to perform arbitrary shuffles of our m-bit word in a single instruction (for instance using pshufb on SSE registers). This style relies on horizontal SIMD operations: we can perform element-wise computations within a single register, i.e. along an horizontal direction.

We call this technique horizontal slicing, or hslicing for short. Like vslicing, it lowers register pressure compared to bitslicing, and requires few inputs to fill the registers. While vslicing shines where arithmetic operations are needed, hslicing is especially useful to implement ciphers that rely on intra-register permutations, since those can be performed in a single shuffle instruction. This includes Rectangle’s permutations, which can be seen as left-rotations, or Piccolo’s permutation which operates on 8 bits [5].

Inter-register permutations however, will be very expensive, as they will require manually extracting bits from different registers and recombining them.

Hsliced Rectangle

On Rectangle, the 64-bit input is again seen as 4 times 16 bits, but this time, each bit goes to a distinct packed element:

Once again, there are unused bits in the registers, which can be filled with subsequent inputs. The second input would go into the second bits of each packed element:

Like for vslicing, this is repeated until the registers are full, which in the case of Rectangle on SSE requires a total of 8 inputs. Numbering the bits from 0 (most significant) to 63 (least significant) can help visualize hslicing’s data layout:

Here, the 8-bit element labelled 0 contains the first bit of each of the 8 inputs.

Rectangle Sbox, composed solely of bitwise instructions, can be computed using the same instruction as the vsliced implementation. The permutation however, can be done even more efficiently than in vslicing, using shuffle instructions. For instance, the first left-rotation can be done with the following shuffle:

Bitslicing vs vslicing vs hslicing

Terminology. Whenever the slicing direction (i.e. vertical or horizontal) is unimportant, we talk about m-slicing (assuming m > 1) and we call slicing the technique encompassing both bitslicing and m-slicing.

The following picture summarizes all slicing forms:

All those slicing forms share the same basic properties: they are constant-time, they rely on data-parallelism to increase throughput, and they use some sort of tranposition to convert the data into a layout suitable to their model.

However, each slicing form has its own strenghts and weaknesses:

transposing data to hslicing or vslicing usually has a cost almost negligible compared to a full cipher, while a bitslice transposition is much more expensive.
bitslicing introduces a lot of spilling, thus reducing its potential performances, unlike hslicing and vslicing.
only vslicing is a viable option on ciphers using arithmetic operations, since it can use CPU SIMD arithmetic instructions. As shown in the previous post, trying implement those instructions in bitslicing (and hslicing) does not yield good performances.
bitslicing is much better at doing permutations than both hslicing and vslicing, since it can do any permutation at compile-time. Intra-register permutations can still be done efficiently (but at run time) in hslicing using SIMD shuffle instructions. Vslicing will only be a reasonable option in the absence of permutations, or when permutations can be easily expressed using rotations or shifts (like Rectangle’s permutation for instance, which can be written as 3 rotations).
bitslicing requires much more data to provide a high-throughput than hslicing and vslicing. On Rectangle for instance, 8 independent inputs are required to maximize the throughput on SSE registers using mslicing, while 128 inputs are required when using bitslicing.
both vslicing and hslicing rely on vector instructions, and are thus only available on CPUs with SIMD extensions, while bitslicing does not require any hardware-specific.

The choice between bitslicing, vslicing and hslicing thus depends on both the cipher and the target architecture. For instance, on CPUs with SIMD extensions, the fastest implementations of DES are bitsliced, the fastest implementations of Chacha20 are vsliced, while the fastest AES implementations are hsliced. Even for a given cipher, some slicings might be faster depending on the architecture: on X86-64, the fastest implementation of Rectangle is bitsliced, while on AVX, vslicing is ahead of both bitslicing and hslicing. If we exclude the cost of transposing the data, then hsliced and vsliced implementations of Rectangle have the same performances.

Manually estimating which slicing will be the most efficient for a given cipher can often be complicated, especially on complex CPU architectures. Instead, Usuba allows to write codes which are polymorphic on the slicing types (when possible), making it easy to switch from a slicing form to another. The developper can thus run actual sliced implementations rather than guess which one would be faster.

References

[1] D. J. Bernstein, Chacha, a variant of Salsa20, 2008.

[2] N. Ferguson et al., The Skein Hash Function Family, 2010.

[3] E. Käsper, P. Schwabe, Faster and Timing-Attack Resistant AES-GCM, CHES, 2009.

[4] W. Zhang et al, RECTANGLE: A Bit-slice Lightweight Block Cipher Suitable for Multiple Platforms, 2014.

[5] K. Shibutani et al., Piccolo: An Ultra-Lightweight Blockcipher, CHES, 2011.