Tornado, Automatic Generation of Probing-Secure Masked Sliced Implementations

Published on 2020-07-02 by Darius Mercadier

CPU leak information through timing [49], caches accesses [52,53,54], power consumption [48,49,42], electromagnetic radiations [45,46,47] and other compromising emanations [50,51]. A way to recover this leaked information is to design a Simple Power Analysis (SPA) or a Differential Power Analysis (DPA) attack [23], which work by analyzing power or EM traces from a cipher’s execution to recover the key and plaintext.

Several approaches to mitigate SPA and DPA have been proposed. This includes inserting dummy instructions, working with both data and their complements, using specific hardware modules to randomize power consumption [40], or designing hardware gates whose power consumption is independent of their inputs [32,41]. The most popular countermeasure against SPA and DPA, called masking, is however based on secret sharing [24,25]. Masking, unlike some other countermeasures mentionned above, does not require custom hardware, and offers provable security: given an attacker model, one can prove that a masked code does not leak any secret data.

Masking

The first masking schemes were proposed by Chari et al. [38] and Goubin and Patarin [39]. The principle of masking is to split each intermediate value b of a cipher into k random values r1 … rk (called shares) such that b = r1 ∗ r2 ∗ ... ∗ rk for some group operation ∗. The most commonly used operation is the exclusive or (Boolean masking), but modular addition or modular multiplication can also be used (arithmetic masking) [28,26].

A masked implementation using 2 shares is said to be first-order masked. Early masked implementations were only first-order masked [28,26,37,27], which is sufficient to protect against first-order DPA. However, higher order DPA (HO-DPA) can be designed by combining multiple sources of leakage to reconstruct the secrets, and can break first-order masked implementations [23,42,55,56]. Several masked implementations using more than two shares (higher-order masking) where thus proposed to resist HO-DPA [29,63,65,66,68]. The term masking order is used to refer to the number of shares minus one. Similarly, an attack is said to be of order t if it combines t sources of leakage.

The probing-model is widely used to analyze the security of masked software implementations against side-channel attacks. This model was introduced by Ishai, Sahai and Wagner in [29] to construct circuits resistant to hardware probing attacks. It was latter shown [63,68,64] that this model and the underlying constructions were instrumental to the design of efficient secure masked cryptographic implementations. A masking scheme is secure against a t-probing adversary, i.e. an adversary who can probe t arbitrary variables during the computation, is indeed secure by design against the class of side-channel attacks of order t [91].

Most masking schemes consider the cipher as a Boolean or arithmetic circuit. Individual gates of the circuit are replaced by gadgets that process masked variables. One of the important contributions of [29] was to propose a multiplication gadget secure against t-probing attacks for any t, based on a Boolean masking of order n = 2t+1. This was reduced to the tight order n = t+1 in [66] by constraining the two input sharings to be independent, which could be ensured by the application of a mask-refreshing gadget when necessary. The design of secure refresh gadgets and, more generally, the secure composition of gadgets were subsequently subject to many works [64,87,92]. Of particular interest, the notions of Non-Interference (NI) and Strong Non-Interference (SNI) introduced in [87] provide a practical framework for the secure composition of gadgets that yields tight probing-secure implementations. In a nutshell, such implementations are composed of ISW multiplication and refresh gadgets (from the names oftheir inventors Ishai, Sahai, and Wagner [48]) achieving the SNI property, and of sharewise addition gadgets. The main difficulty consists in identifying the minimal number of refresh gadgets and their (optimal) placing in the implementation to obtain a provable t-probing security.

Terminology. In the context of a Boolean circuit, multiplications correspond to and and or gates, while additions correspond to xor gates. Multiplications (and therefore and and or) are said to be non-linear, while additions (and therefore xor) are said to be linear.

Tornado

Belaïd et al. [72] developed a tool named tightPROVE, which is able to verify the t-probing security of a masked implementation at any order t. This verification is done in the bit probing model, meaning that each probe only retrieves a single bit of data. It is thus limited to verify the security of programs manipulating 1-bit variables. Raphaël Wintersdorff, Sonia Belaïd, and Matthieu Rivain recently developped tightPROVE⁺, an extension of tightPROVE to the register probing model where variables can hold an arbitrary number of bits, all of which leak simultaneously. Futhermore, tightPROVE⁺ automatically insert refreshes when needed, whereas tightPROVE was only able to detect that some refreshes were missing. We worked with them to integrate tightPROVE⁺ with Usuba [90]. The resulting tool is called Tornado. Tornado generates provably t-probing secure masked implementations in both the bit-probing and the register-probing model.

Given a high-level description of a cryptographic primitive, Tornado synthesizes a masked implementation using ISW-based multiplication and refresh gadgets [1], and sharewise addition gadgets. The key role of Usuba is to automate the generation of a sliced implementation, upon which tightPROVE+ is then able to verify either the bit probing or register probing security, or suggest the necessary refreshes. By integrating both tools, we derive a probing-secure masked implementation from a high-level description of a cipher.

The overall architecture of Tornado is shown below:

After normalization and optimization, the Usuba0 program is sent to tightPROVE⁺, which adds refreshes if necessary. The output of tightPROVE⁺ is translated back to Usuba0, which Usubac then masks: variables are replaced with arrays of shares, linear operators and non-linear operators involving constants are applied share-wise, and other non-linear operators and refreshes are left as is to be replaced by masked gadgets during C code generation. Before generating C code, a pass of loop fusion merges share-wise applications of linear operators when possible.

Encoding and Decoding Usuba0 for tightPROVE⁺

Tornado targets low-end embedded devices with drastic memory constraints. It is therefore essential to produce a C code that uses loops and functions, in order to reduce code size, especially when masking at high orders. This is one of the main motivation to keep loops and nodes in the Usuba pipeline despite their semantics being defined through macro-expansion to straightfoward equations.

However, tightPROVE+ only supports straight line programs, providing no support for loops and functions. Usubac thus fully unrolls and inlines the Usuba0 code before interacting with tightPROVE⁺. The refreshes inserted by tightPROVE⁺ must then be propagated back into the Usuba0 program, featuring loops, nodes and node calls.

To do so, when inlining and unrolling, we keep track of the origin of each instruction: which node they come from, and which instruction in that node. For each refresh inserted by TightPROVE⁺, we therefore know where (i.e., which node and which loop, if any) the refreshed variable comes from, and are therefore able to insert the refresh directly in the right node.

In order to help tightPROVE⁺ to find where to insert refreshes, a first pass sends each node (before inlining) to tightPROVE⁺ for verification. Those nodes are smaller than the fully inlined program, and tightPROVE⁺ is thus faster to detect the required refreshes. Only then are all nodes inlined and the whole program is sent to tightPROVE⁺. This is useful for instance on Ace, which contains the following node:

node f(x:u32) returns (y:u32)
let
    y = ((x <<< 5) & x) ^ (x <<< 1)
tel

This node is obviously vulnerable to probing attacks, and tightPROVE⁺ is able to insert a refresh on one of the x to make it probing-secure. Once the secure version of f is inlined in Ace, no other refreshes are needed.

Since tightPROVE⁺’s verification often takes hours or even days to verify a whole cipher [3], Usubac uses a cache to avoid calling tightPROVE⁺ multiple times on the same circuit.

Masking

To mask an Usuba0 program, Usubac replaces each variable with an array of shares, and each operator with a masked gadget. The gadget to mask a linear operator (xor) is simply a loop applying the operator on each share, written directly in Usuba. To mask a negation, we only negate the first share (since ~(r0 ^ r1 ^ ... ^ rk) == ~r0 ^ r1 ^ ... ^ rk), still in Usuba. To mask non-linear operators (and and or), however, we introduce the operators m& and m| in Usuba, which are transformed into calls to masked C gadgets when generating C code. In particular, for the multiplication (and), we use the so-called ISW gadget introduced in [1]:

static void isw_mult(uint32_t *res,
                     const uint32_t *op1,
                     const uint32_t *op2) {
    for (int i = 0; i <= MASKING_ORDER; i++)
        res[i] = 0;
     
    for (int i = 0; i <= MASKING_ORDER; i++) {
        res[i] ^= op1[i] & op2[i];
        
        for (int j = i+1; j <= MASKING_ORDER; j++) {
            uint32_t rnd = get_random();
            res[i] ^= rnd;
            res[j] ^= (rnd ^ (op1[i] & op2[j])) ^ (op1[j] & op2[i]);
        }
    }
}

Note that this algorithm has a quadratic complexity in the number of shares. ors are transformed into not and and, since a | b == ~(~a & ~b). Similarly, refreshes, either inserted manually by the developper or automatically by tightPROVE⁺, are compiled into calls to the ISW refresh routine when generating C code:

static void isw_refresh(uint32_t *res,
                        const uint32_t *in) {
    for (int i=0; i<=MASKING_ORDER; i++)
        res[i] = in[i];

    for (int i=0; i<=MASKING_ORDER; i++) {
        for (int j=i+1; j<=MASKING_ORDER; j++) {
            uint32_t rnd = get_random();
            res[i] ^= rnd;
            res[j] ^= rnd;
        }
    }
}

Constants

Constants are not secret values and thus do not need to be shared. Furthermore, when multiplying a constant with a secret value, there is no need to use the ISW mutliplication gadgets: we can simply multiply each share of the secret value with the constant. The cost of masking a multiplication by a constant is thus linear in the number of shares, rather than quadratic. Our benchmarks (see section “Evaluation” below) show that indeed, the more masked multiplication a cipher has, the slower it is. Avoiding unnecessary masked multiplications is thus essential.

For instance, Pyjamask’s linear layer contains a matrix multiplication, implemented by calling 4 times the following Usuba node, which multiplies two 32-bit vectors a and b:

node col_mult (a:b32,b:b32) returns (r:b1)
vars acc:b32
let
    acc[0] = a[0] & b[0];
    forall i in [1,31] {
        acc[i] = acc[i-1] ^ (a[i] & b[i])
    }
    r = acc[31]
tel

When this node is called in Pyjamask, b is always constant, and the multiplication (a[i] & b[i]) therefore does not need to be masked. Usubac uses a simple inference algorithm to track which variables are constant and which are not, and is thus able to identify that b does not need to be shared, and that the multiplication a[i] & b[i] does not need to be masked. This optimization both reduces stack usage (since constant variables are kept as a single share rather than an array of shares) and increases performances (since multiplying by a constant becomes linear rather than quadratic).

Loop Fusion

Since each linear operation is replaced with a loop applying the operation on each share, the masked code contains a lot of loops. The overhead of those loops is detrimental to performances. Consider for instance the following Usuba0 snippet:

x = a ^ b;
y = c ^ d;
z = x ^ y;

After masking, it becomes:

forall i in [0, MASKING_ORDER-1] {
    x[i] = a[i] ^ b[i];
}
forall i in [0, MASKING_ORDER-1] {
    y[i] = c[i] ^ d[i];
}
forall i in [0, MASKING_ORDER-1] {
    z[i] = x[i] ^ y[i];
}

In order to reduce the overhead of looping over each share for linear operations, we aggressively perform loop fusion (also called loop jamming) in Usubac, which consists in replacing multiple loops with a single one. The above snippet is thus optimized to:

forall i in [0, MASKING_ORDER-1] {
    x[i] = a[i] ^ b[i];
    y[i] = c[i] ^ d[i];
    z[i] = x[i] ^ y[i];
}

Loop fusion is also able to reduce the amount of stack used by allowing Usubac to replace temporary arrays with temporary variables. In the example above, Usubac would thus convert x and y into scalars rather than arrays and produce the following code:

forall i in [0, MASKING_ORDER-1] {
    x = a[i] ^ b[i];
    y = c[i] ^ d[i];
    z[i] = x ^ y;
}

On embedded devices (which Tornado targets), this is especially beneficial since the amount of stack available is very limited (e.g. a few dozen of kilobytes).

C compilers also perform loop fusion, but experimentally, they are less aggressive than Usubac. We thus obtain better performances by fusing loops directly in Usubac. On the 11 ciphers of the NIST competition we implemented in Usuba and compiled with Tornado, performing loop fusion in Usubac allows us to reduce stack usage of our bitsliced implementations by 11kB on average whereas this saves us, on average, 3kB of stack for our msliced implementations (note that our platform offers a measly 96kB of SRAM). It also positively impacts performance, with a 16% average speedup for bitslicing and a 21% average speedup for n-slicing.

Evaluation

We used Tornado to compare 11 cryptographic primitives from the second round of the ongoing NIST lightweight cryptography competition (LWC). The choice of cryptographic primitives was made on the basis that they were self-identified as being amenable to masking. We stress that we do not focus on the full authenticated encryption, message authentication, or hash protocols but on the underlying primitives, mostly block ciphers and permutations. The complete list of primitives follows:

submissions	primitive	msliceable	slice size
block ciphers
Gift-Cofb [7], Hyena [17], Sundae-Gift [18]	Gift-128	✔	32
Pyjamask [10]	Pyjamask-128	✔	32
Skinny [11], Romulus [19]	Skinny-128-256	✖	-
Spook [6]	Clyde-128	✔	32
permutations
Ace [4]	Ace	✔	32
Ascon [5]	p¹²	✔	64
Elephant [13,12]	Spongent-π[160]	✖	-
Gimli [8]	Gimli-36	✔	32
Orange [20], Photon-Beetle [9]	Photon-256	✖	-
Xoodyak [15,16]	Xoodoo	✔	32
others
Subterranean [14]	blank(8)	✖	-

Whenever possible, we generate both a bitsliced and an n-sliced implementation for each primitive, which allows us to exercise the bit-probing and the register-probing models of tightPROVE+. However, 4 primitives do not admit a straightforward n-sliced implementation. The Subterranean permutation involves a significant amount of bit-twiddling across its 257-bit state, which makes it a resolutely bitsliced primitive (as confirmed by its reference implementation). Photon, Skinny, Spongent rely on lookup tables that would be too expansive to emulate in n-sliced mode. In bitslicing, these tables are simply implemented by their Boolean circuit, either provided by the authors (Photon, Skinny) or generated through SAT [2] with the objective of minimizing multiplicative complexity (Spongent, with 4 ANDs and 28 XORs).

Note that the n-sliced implementations, when they exist, are either 32-sliced or 64-sliced. This means in particular that, unlike bitslicing that allows processing multiple blocks in parallel, these implementations process a single block at once on our 32-bit Cortex M4. Note also that in practice, all n-sliced implementation are vsliced, since the Cortex M4 we consider does not provide SIMD extensions, which are required for hslicing.

Running tightPROVE⁺ on our Usuba implementations showed that all of them were t-probing secure in the bit-probing model, but 3 were not secure in the register-probing model. ACE required 384 additional refreshes, Clyde-128 required 6 refreshes, and Gimli required 120. In the case of ACE and Clyde, the number of refreshes inserted by tightPROVE⁺ is known to be minimal, while for Gimli, it is only an upper bound [3].

Baseline Performance Evaluation

In the following, we benchmark our implementations –written in Usuba and compiled by Tornado– of the NIST submissions against the reference implementation provided by the contestants. This allows us to establish a performance baseline (without masking), thus providing a common frame of reference for the performance of these primitives based on their implementation synthesized from Usuba. In doing so, we have to bear in mind that the reference implementations provided by the NIST contestants are of varying quality: some appear to have been finely tuned for performance while others focus on simplicity, acting as an executable specification.

Several candidates of the NIST LWC do not provide implementations for ARM devices. However, since they all provide a reference (generic) C implementation, we first benchmarked our implementations against the reference implementations on an Intel i5-6500 @ 3.20GHz, running Linux 4.15.0-54. The implementations were compiled with Clang 7.0.0 with flags -O3 -fno-slp-vectorize -fno-vectorize. These flags prevent Clang from trying to produce vectorized code, which would artificially advantage some implementations at the expense of others because of brittle, hard-to-predict vectorization heuristics. Besides, vectorized instructions remain an exception in the setting of embedded devices (e.g., Cortex M). At the exception of Subterranean (which is bitsliced), the reference implementations follow an msliced implementation pattern, representing the state of the primitive through a matrix of 32-bit values, or 64-bit in the case of Ascon. To evaluate bitsliced implementations, we simulate a 32-bit architecture, meaning that the throughput we report corresponds to the parallel encryption of 32 independent blocks. The cost of transposing data into a bitslice format (around 9 cycles/bytes to transpose a 32x32 matrix) is excluded. The results are shown in the following table:

Primitive	Performances (cycles/bytes) (lower is better)
Primitive	Usuba n-slice	Usuba bitslice	reference
Ace	34.25	55.89	273.53
Ascon	9.84	4.94	5.18
Clyde	33.72	21.99	37.69
Gimli	15.77	5.80	44.35
Gift	565.30	45.51	517.27
Photon	-	44.88	214.47
Pyjamask	246.72	131.33	267.35
Skinny	-	46.87	207.82
Spongent	-	146.93	4824.97
Subterranean	-	17.64	355.38
Xoodoo	14.93	6.47	10.14

We notice that Usuba often delivers performance that is on par or better than the reference implementations. Note that this does not come at the expense of readability: our Usuba implementations are written in a high-level language, which is amenable to formal reasoning thanks to its straightforward semantics model. The reference implementations of Skinny and Photon use lookup tables, which do not admit a straightforward implementation constant-time. As a result, we are unable to implement a constant-time msliced version in Usuba and to mask such an implementation.

We now turn our attention specifically to a few implementations that exhibit interesting performance.

The reference implementation of Subterranean is an order of magnitude slower than in Usuba because its implementation is bit-oriented (each bit is stored in a distinct 8-bit variable) but only a single block is encrypted at a time. Switching to 32-bit variables and encrypting 32 blocks in parallel, as Usuba does, significantly improves performance.

The reference implementation of Spongent is slowed down by a prohibitively expensive 160-bit bit-permutation, which is spread across 20 8-bit variables. Thanks to bitslicing, Usuba turns this permutation into a static renaming of variable, which occurs at compile-time.

On Ascon, our n-sliced implementation is twice slower than the reference implementation. Unlike the reference implementation, we have refrained from performing aggressive function inlining and loop unrolling to keep code size in check, since we target embedded systems. However, if we instruct the Usuba compiler to perform these optimizations, the performance of our n-sliced implementation is on par with the reference one.

Ace reference implementation suffers from significant performance issues, relying on an excessive number of temporary variables to store intermediate results.

Gimli offers two reference implementations, one being a high-performance SSE implementation with the other serving as an executable specification on general-purpose registers. We chose the general-purpose one here (which had not been subjected to the same level of optimizations) because our target architecture (Cortex M) does not provide a vectorized instruction set.

Finally, Gift’s n-slice implementation suffers from sever performance issues because of its expensive linear layer. Using the recent fixslicing technique [21] improves the performance of Usuba’s n-slice Gift implementation down to 42 cycles/byte.

Masking benchmarks

We ran our benchmarks on a Nucleo STM32F401RE offering an Arm Cortex-M4 with 512 Kbytes of Flash memory and 96 Kbytes of SRAM. We used the GNU C compiler arm-none-eabi-gcc version 9.2.0 at optimization level -O3. We considered two modes regarding the Random Number Generator (RNG):

Pooling mode: The RNG generates random numbers at a rate of 32 bits every 64 clock cycles. Fetching a random number can thus take up to 65 clock cycles.
Fast mode: The RNG has a production rate faster than the consumption rate of the program. The RNG routine thus can simply read a register containing this 32-bit random word without checking for its availability.

Those two modes were chosen because they are the ones used in the Pyjamask submission, which is the only submission addressing the issue of getting random numbers for a masked implementation.

Scaling

N-slicing

The table bellow gives the performances of the n-sliced implementations produced by Tornado in terms of cycles per block. Note that this does not reflect the performances in term of cycles/bytes, since those ciphers have different block size. For instance, encrypting a single block with Xoodoo is more than twice slower than with Clyde (at any given order), but since Xoodoo’s block is 384 bits, while Clyde’s is only 128 bits, the performances of Xoodoo in terms of cycles/bytes are better than Clyde’s. Nevertheless, we focus here on the performances per block, and analyze the performances per bytes in the next section, when considering bitsliced implementations.

primitive	mults	RNG mode	Performances (cycles) (lower is better)
primitive	mults	RNG mode	d = 0	d = 3	d = 7	d = 15	d = 31	d = 63	d = 127
Clyde	48	fast	1.47k	15.38k	56.48k	189.44k	670.08k	2.58m	10.45m
Clyde	48	pooling	1.47k	30.08k	121.08k	502.88k	1.94m	7.73m	29.92m
Pyjamask	56	fast	15.90k	79.52k	205.44k	614.40k	1.73m	4.75m	15.20m
Pyjamask	56	pooling	15.90k	94.88k	274.56k	954.56k	3.10m	10.34m	36.32m
Ascon	60	fast	1.96k	42.00k	123.20k	464.40k	1.70m	6.52m	25.60m
Ascon	60	pooling	1.96k	53.60k	182.80k	821.60k	3.17m	12.96m	52.00m
Xoodoo	144	fast	3.02k	42.67k	156.48k	520.32k	1.89m	6.86m	26.64m
Xoodoo	144	pooling	3.02k	82.08k	334.08k	1.40m	5.42m	21.50m	83.04m
Gift	160	fast	17.92k	200.48k	516.32k	1.24m	6.45m	13.10m	42.24m
Gift	160	pooling	17.92k	244.32k	714.88k	2.21m	8.51m	29.12m	102.40m
Gimli	288	fast	2.69k	84.96k	342.72k	1.19m	4.57m	17.09m	67.20m
Gimli	288	pooling	2.69k	190.56k	832.80k	3.52m	14.06m	56.16m	218.88m
Ace	384	fast	3.68k	155.20k	531.60k	1.60m	7.60m	29.84m	113.60m
Ace	384	pooling	3.68k	302.00k	1.32m	4.56m	19.80m	78.40m	310.80m

Since masking a multiplication has a quadratic cost in the number of shares, we expect performance at high orders to be mostly proportional with the number of multiplications used by the primitives. We thus report the number of multiplications involved in our implementations of the primitives. This is confirmed by our results with 128 shares (on the Cortex M4). This effect is less pronounced at small orders since the execution time remains dominated by linear operations. Using the pooling RNG increases the cost of multiplications compared to the fast RNG, which results in performances being proportional to the number of multiplications at smaller order than with the fast RNG.

Pyjamask illustrates the influence of the number of multiplications on scaling. Because of its use of dense binary matrix multiplications, it involves a significant number of linear operations for only a few multiplications. As a result, it is slower than Xoodoo at order 3, despite the fact that Xoodoo uses 2.5× more multiplications. With the fast RNG, the inflection point is reached at order 31 only to improve afterward. Similarly when compared to Clyde, Pyjamask goes from 5× slower at order 3 to 50% slower at order 127 with the fast RNG and 20% slower at order 127 with the pooling RNG. The same analysis applies to Gift and Gimli, where the linear overhead of Gift is only dominated at order 7 with the pooling RNG and at order 63 with the fast RNG.

One notable exception is Ascon with the fast RNG, compared in particular to Clyde and Pyjamask. Despite Ascon using only slightly more multiplications, it involves a 64-sliced implementation, unlike its counterparts that are 32-sliced. Running on our 32-bit Cortex-M4 requires GCC to generate 64-bit emulation code, which induces a significant operational overhead and prevents further optimization by the compiler. Regardless of the RNG used, Ascon is thus significantly slower than Clyde (at order 3 and beyond) and Pyjamask (at order 63 and beyond).

Bitslicing

The key limiting factor to execute bitslice code on an embedded device is the amount of memory available. Bitsliced programs tend to be large and to consume a significant amount of stack. Masking such implementations at high orders becomes quickly impractical because of the quadratic growth of the stack usage.

To reduce stack usage and allow us to explore high masking orders, our bitsliced programs manipulate 8-bit variables, meaning that 8 independent blocks can be processed in parallel. This trades memory usage for performance, as we could have used 32-bit variables and improved our throughput by a factor 4. However, doing so puts an unbearable amount of pressure on the stack, which would have prevented us from considering masking orders beyond 7. Besides, it is not clear whether there is a use-case for such a massively parallel (32 independent blocks) encryption primitive in a lightweight setting. As a result of our compilation strategy, we have been able to mask all primitives with up to 16 shares and, additionally, reach 32 shares for Photon, Skinny, Spongent and Subterranean.

Bitslice performances (fast RNG)
primitive	mults/bits	Performances (cycles/bytes) (lower is better)
primitive	mults/bits	d = 0	d = 3	d = 7	d = 16	d = 32
Subterranean	8	94	2.15k	7.18k	27.23k	95.19k
Ascon	12	101	3.07k	11.45k	42.39k
Xoodoo	12	112	3.12k	10.49k	38.35k
Clyde	12	177	3.44k	13.57k	45.34k
Photon	12	193	7.66k	14.28k	44.99k	154k
Pyjamask	14	1.59k	16.52k	31.76k	97.88k
Gimli	24	127	5.51k	19.15k	76.91k
Ace	38	336	8.22k	35.29k	123k
Gift	40	358	11.08k	36.79k	136k
Skinny	48	441	18.19k	61.75k	200k	664k
Spongent	80	624	19.45k	64.78k	259k	948k

As for the n-sliced implementations, we observe a close match between the asymptotic performance of the primitive and their number of multiplications per bits, which becomes even more prevalent as order increases and the overhead of linear operations becomes comparatively smaller. Pyjamask remains a good example to illustrate this phenomenon, the inflection point being reached at order 15 with respect to Ace (which uses 3× more multiplications).

While Ascon with the fast RNG is slowed down by its suboptimal use of 64-bit registers in mslicing, its performance in bitslicing is similar to Xoodoo’s, which exhibits the same number of multiplication per bits.

Bitslice performances (pooling RNG)
primitive	mults/bits	Performances (cycles/bytes) (lower is better)
primitive	mults/bits	d = 0	d = 3	d = 7	d = 16	d = 32
Subterranean	8	94	4.46k	19.13k	79.63k	312k
Ascon	12	101	7.33k	30.33k	125k
Xoodoo	12	112	6.69k	28.79K	120K
Clyde	12	177	7.88k	31.04k	127k
Photon	12	193	10.47k	31.77k	126k	476k
Pyjamask	14	1.59k	20.33k	52.81k	193k
Gimli	24	127	12.14k	53.64k	236k
Ace	38	336	19.94k	89.12k	395k
Gift	40	358	21.38k	93.92k	405k
Skinny	48	441	34.28k	131k	525k	1.97m
Spongent	80	624	44.04k	188k	816k	3.15m

Finally, we observe that with the pooling RNG, already at order 15, the performances of our implementations is in accord with their relative number of multiplications per bits. In bitslicing (more evidently than in n-slicing), the number of multiplications is performance critical, even at relatively low masking order.

Usuba vs reference and future work

Of these 11 NIST submissions, only Pyjamask provides a masked implementation. Our implementation is consistently (at every order, and with both the pooling and fast RNGs) 1.8 times slower than their masked implementation. The main reason is that they manually implemented in assembly the two most used routines.

The first one is a matrix multiplication. The Usuba code for this matrix multiplication is:

node mat_mult(col:u32,vec:u32) returns (res:u32)
vars
    mask:u32
let
    res  = 0;

    forall i in [0, 31] {
        mask    := bitmask(vec,i);
        res     := res ^ (mask & mat_col);
        mat_col := mat_col[i] >>> 1;
    }
tel

When compiled with gcc arm-none-eabi-gcc 4.9.3, the body of the loop compiles to 6 instructions. The reference implementation, on the other hand, inlines the loop and uses ARM’s barrel shifter. As a result, each iteration takes only 3 instructions. Using this hand-tuned matrix multiplication rather than Usuba’s matrix multiplication offers a speedup of 15% to 52%, depending on the masking order, as shown in the Table below:

masking order	3	7	15	31	63	127
speedup	52%	49%	43%	34%	24%	15%

The slowdown is less important at high masking orders because this matrix multiplication has a linear cost in the masking order, while the masked multiplication has a quadratic cost in the masking order (since it must compute all cross products between two shared values).

The second hand-optimized routine is the masked multiplication. The authors of the reference implementation used the same ISW multiplication gadget as we used in Usuba (shown in Section Masking), but manually optimized it in assembly. They unrolled twice the main loop of the multiplication, then fused the two resulting inner loops, and carefully assigned 6 variables to registers. If we note m the masking order, this optimization allows the masked multiplication to require m less jumps (m/2 for inner loops and m/2 outer loops), to use m/2 * 10 less memory accesses thanks to their manual assignment of array values to registers, and to require m/2*3 less memory accesses thanks to the merged inner loops. gcc fails to perform this optimization automatically.

Unlike the hand-optimized matrix multiplication, whose benefits diminish as the masking order increases, the optimized masked multiplication is more beneficial at higher masking order, as can be seen from the Table below, which compares it with the C implementation provied earlier in this post:

masking order	3	7	15	31	63	127
speedup	14%	21%	33%	37%	39%	41%

While the Usuba implementations can be sped up by using the hand-tuned version of the ISW masked multiplication, these two examples (matrix multiplication and masked multiplication) hints at the limits of achieving high performance by going through C. Indeed, while the superscalar nature of high-end CPUs can make up for occasional sub-optimal codes produced by C compilers, embedded devices are less forgiving. Similarly, Schwabe and Stoffelen [22] designed high-performance implementations of AES on Cortex-M3 and M4 microprocessors, and prefered to write their own register allocator and instruction scheduler (in less that 250 lines of Python), suggesting that gcc’s register allocator and instruction scheduler may be far from optimal for cryptography on embedded devices.

We leave for future work to directly target ARM assembly in Usuba, and to implement architecture-specific optimizations for embedded microprocessors. The resulting compiler backend is described in the figure below:

As of now, code-generation is the very last pass of Usubac, and does not go through its own intermediate representation (IR): the C code is generated directly from Usuba0. Optimizing down to assembly would require a “lowering” pass that would convert the Usuba0 into an assembly offering an infinite number of registers. Then, several optimization passes would be needed. Among the most important, a peephole optimizer would attempt to recognize and optimize commonly used idioms. For instance, a shift followed by an instruction using that shift could be optimized into a single instruction using the barrel shifter. Finally, register allocation and instruction scheduling (performed either separately or together) would produce an AST representing real machine code, which could then be printed to produce an assembly file. Each of those passes would be architecture-specific.

Several approaches have been proposed to automatically secure implementations against power-based side-channel attacks. We review the main ones in chronological order.

Bayrak et al. [74] designed one of the first automated tool to protect cryptographic code at the software level. Their approach is purely empirical: they identify sensitive operations by running the cipher while recording power traces. The operations that are found to leak information are then protected using random precharging [73], which consists in adding random instructions before and after the offending instructions so as to lower the signal-to-noise ratio.

Agosta et al. [78] developed a tool to automatically protect ciphers against DPA using code morphing, which consists in dynamically (at runtime) generating and randomly choosing the code to be executed. To do so, the cipher is broken into fragments of a few instructions. For each fragment, several alternative (but semantically equivalent) sequences of instructions are generated at compile time. At runtime, a polymorphic engine selects one sequence from each of the fragments. The resulting code can leak secret data, but an attacker would not know where in the original cipher the leak is, since the code executing is dynamically randomized.

Moss et al. [76] proposed the first compiler to automate first-order Boolean masking. Based on CAO [77], the developers annotates the inputs of the program with secret or public, and the compiler infers for each computation whether the result is secret (one of its operand is secret) or public (all operands are public). The compiler then searches for operations on secret values that break the masking (for instance a xor where the masks of the operands would cancel out), and fixes them by changing the masks of its operands.

Agosta et al. [89] designed a dataflow analysis to automatically mask ciphers at higher orders. Their analysis tracks the secret dependencies (plaintext or key) of each intermediate variable, and is thus able to detect intermediate results that would leak secret information and require to be masked. They implemented their algorithm in LLVM [80], allowing them to mask ciphers written in C (although they require some annotations in the C code to identify the plaintext and key).

Sleuth [79] is a tool to automatically verify whether an implementation leaks information that is statistically dependent on any secret input and would thus be vulnerable to SCA. Sleuth reduces this verification to a Boolean satisfiability problem, which is solved using a SAT solver. Sleuth is implemented as a pass in the LLVM backend, and thus ensures that the compiler does not break masking by reordering instructions.

Eldib et al. [82] developed a tool similar to Sleuth called SC Sniffer. However, whereas Sleuth verifies that all leakage is statistically independent on any input, SC Sniffer ensures that a cipher is perfectly masked, which is a stronger property: a masked algorithm is perfectly masked at order d if the distribution of any d intermediate results is statistically independent of any secret input [27]. SC Sniffer is also implemented as a LLVM pass and relies on the Yices SMT solver [84] to conduct its analysis. Eldib and Wang [83] then built MC Masker, to perform synthesis of masking countermeasures (SC Sniffer only performs analysis). MC Masker generates the masked program using inductive synthesis: it iteratively adds countermeasures until the SMT solver is able to prove that the program is perfectly masked.

G. Barthe et al. [85] proposed to verify masking using program equivalence. More specifically, their tool (later dubbed maskverif), inspired by [86], tries to construct a bijection between the distribution of intermediate values of the masked program of interest and a distribution that is independent from the secret inputs. Maskverif scales better than previous work (eg. MC Masker) at higher masking order, even though it is not able to handle a full second-order masked AES. Furthermore, Maskverif works is not limited to analyze 1-bit variables, unlike Sleuth and SCSniffer.

G. Barthe et al. [87] later introduced the notion of strong non-interference. Intuitively, a strongly t-non-interfering code is t-probing secure (or t-non-interfering) and composable with any other strongly t-non-interfering code. This is a stronger property than t-probing security, since combining t-probing secure gadgets does not always produce a t-probing secure implementation [64]. They developed a tool (later dubbed maskComp) to generate masked C implementations from unmasked ones. Unlike previous work which could only verify the security of an implementation at a given order, maskComp is able to prove the security of an implementation at any order.

Coron [88] designed a tool called CheckMasks to automatically verify the security of higher-order masked implementations, similarly to maskverif [85]. For large implementations, CheckMasks can only verify the security at small orders. However, CheckMasks is also able to verify the t-SNI (or t-NI) property of gadgets at any order.