The C codes generated by Usubac are compiled to assembly using C compilers. While in C, a (virtually) unlimited amount of variables can be used, CPUs only offer a few registers to work with (between 8 and 32 for commonly used CPUs). C variables are thus mapped to assembly registers using a register allocation algorithms. When more variables are alive thant there are available registers, some variables are spilled: the stack is used to temporary store them.
R. Sethi [1] showed that determining whether a program can be computed using k registers is NP-complete. Heuristic methods are thus used in compilers to map variables to registers. The most classical ones are based on graph coloring [2,3,4,17], and bound to be approximate since graph coloring itself is NP-complete. Another commonly used approach is an algorithm called linear scan [6], which is favored by just-in-time compilers for its speed, at the expense of the quality of the allocation [7,8,9,10].
Register allocation is tightly coupled with instruction scheduling, which consists in ordering the instructions in the best way to take advantage of a CPU’s superscalar microarchitecture [11,12]. To illustrate the importance of instruction scheduling, consider the following sequence of instructions:
u = a + b;
v = u + 2;
w = c + d;
x = e + f;
Consider a CPU that can compute two additions per cycle and performs
no reordering. Such a CPU would execute u = a + b
in its first
cycle, and would be unable to compute v = u + 2
in the same cycle
since u
has not been computed yet. In a second cycle, it would
compute v = u + 2
as well as w = c + d
, and, finally, in a third
cycle, it would compute x = e + f
. On the other hand, if the code
had been scheduled as:
u = a + b;
w = c + d;
v = u + 2;
x = e + f;
It could have been executed in only 2 cycles. While out-of-order CPUs reduce the impact of such data hazards at run-time, this phenomenons still occur and need to be taken into account when statically scheduling instructions.
The interaction between register allocation and instruction scheduling is due to spilling, which introduces memory operations. The cost of those additional memory operations can sometimes be reduced by using a efficient instruction scheduling. Better still, combining register allocation and instruction scheduling can produce more efficient code that performing them separately [13,14,15,16].
In practice, however, C compilers strive to still offer small compilation times, sometimes at the expense of the quality of the generated code. Both GCC and LLVM thus perform instruction scheduling and register allocation separately, starting instruction scheduling, followed by register allocation, followed by some additional scheduling.
Despite reusing traditional graph coloring algorithms, GCC and LLVM’s instruction schedulers and register allocators are highly tuned to produce highly efficient codes, and are the product of decades of tweaking. GCC’s register allocator and instruction scheduler thus amount to more than 50.000 lines of C code, and LLVM’s to more than 30.000 lines of code.
We cash out on both these works by providing two pre-schedulers, in order to produce C codes that will be better optimized by C compilers’ schedulers and register allocators. The first scheduler reduces register pressure in bitsliced ciphers, while the second increases instruction-level parallelism in msliced ciphers.
In bitsliced code, the major bottleneck is register pressure: a significant portion of the execution time is spent spilling registers to and from the stack. Given that hundreds of variables can be alive at the same time in a bitsliced cipher, it is not surprising that the C compilers have a hard time keeping the register pressure down. For instance, it is common for bitsliced ciphers to have up to 40% of their instructions being loads and stores for spilling, as shown in the following table (usuba-generated ciphers compiled with Clang 7.0.0):
Cipher | % of instructions related to spilling |
---|---|
DES | 41% |
Ascon | 43% |
Gift | 38% |
Present | 25% |
Rectangle | 36% |
Serpent | 39% |
We designed a scheduling algorithm that aims at reducing register pressure (and thus improve performances) in bitsliced codes.
Let’s take the example of Rectangle to illustrate why bitsliced codes have so much spilling and how this can be improved. Rectangle’s S-box can be written in Usuba as:
node sbox (a0:u32, a1:u32, a2:u32, a3:u32)
returns (b0:u32, b1:u32, b2:u32, b3:u32)
vars
t1:u32, t2:u32, t3:u32, t4:u32, t5:u32, t6:u32,
t7:u32, t8:u32, t9:u32, t10:u32, t11:u32, t12:u32
let
t1 = ~a1;
t2 = a0 & t1;
t3 = a2 ^ a3;
b0 = t2 ^ t3;
t5 = a3 | t1;
t6 = a0 ^ t5;
b1 = a2 ^ t6;
t8 = a1 ^ a2;
t9 = t3 & t6;
b3 = t8 ^ t9;
t11 = b0 | t8;
b2 = t6 ^ t11
tel
Usubac naive automatic bitslicing would replace all u32
with b32
,
which would cause the variables to become vectors of 32 boolean
elements, thus causing the operators to be unfolded as follows:
node sbox (a0:b32, a1:b32, a2:b32, a3:b32)
returns (b0:b32, b1:b32, b2:b32, b3:b32)
vars ...
let
t1[0] = ~a1[0];
t1[1] = ~a1[1];
t1[2] = ~a1[2];
...
t1[31] = ~a1[31];
t2[0] = a0[0] & t1[0];
t2[1] = a0[1] & t1[1];
...
t2[31] = a0[31] & t1[31];
...
While the initial S-box contained less than 10 variables
simultaneously alive, the bitsliced S-box contains 32 times more
variables alive at any point. A first optimization to reduce the
register pressure thus happens during the automatic bitslicing: Usubac
detects nodes computing only bitwise instructions and calls to nodes
with the same property. Those nodes are specialized to take b1
variables as input, and their call sites are replaced with several
calls instead of one. In the case of Rectangle, the shorter S-box is
thus called 32 times rather than calling the large S-box once. If the
initial pre-bitslicing S-box call was:
(x0, x1, x2, x3) = sbox(y0, y1, y2, y3)
with x0
, x1
, x2
, x3
, y0
, y1
, y2
, y3
of type
u32
. Then, the naive bitsliced call would remain the same except
that variables would be typed b32
, and sbox
would be the modified
version that takes b32
as inputs and repeats each operation 32
times. The optimized bitsliced version however is:
(x0[0], x1[0], x2[0], x3[0]) = sbox(y0[0], y1[0], y2[0], y3[0]);
(x0[1], x1[1], x2[1], x3[1]) = sbox(y0[1], y1[1], y2[1], y3[1]);
...
(x0[31], x1[31], x2[31], x3[31]) = sbox(y0[31], y1[31], y2[31], y3[31]);
Where sbox
calls the initial sbox
specialized with input types
b1
.
Concretely, this optimization reduces register pressure, at the expense of more function calls. The benefits heavily depends on the ciphers and C compilers used. On AVX registers using Clang as C compiler, this yields a 34% speedup on Ascon, 12% on Gift and 6% on Clyde.
The same optimization cannot be applied to linear layers however, as they almost always contain rotations, shifts or permutations. Consider for instance Rectangle’s linear layer:
node ShiftRows (input:u16x4) returns (out:u16x4)
let
out[0] = input[0];
out[1] = input[1] <<< 1;
out[2] = input[2] <<< 12;
out[3] = input[3] <<< 13
tel
Using a u1x4
input instead of a u16x4
is not possible due to the
left rotations, which requires all 16 bits of each input. Instead,
this function will be bitsliced and become:
node ShiftRows (input:b1[4][16]) returns (out:b1[4][16])
let
out[0][0] = input[0][0]; -- out[0] = input[0]
out[0][1] = input[0][1];
...
out[0][15] = input[0][15];
out[1][0] = input[1][1]; -- out[1] = input[1] <<< 1
out[1][1] = input[1][2];
...
out[1][15] = input[1][0];
out[2][0] = input[2][12]; -- out[2] = input[2] <<< 12
out[2][1] = input[2][13];
...
out[2][15] = input[2][11];
out[3][0] = input[3][13]; -- out[3] = input[3] <<< 13
out[3][1] = input[3][14];
...
out[3][15] = input[3][12];
tel
In the case of Rectangle’s ShiftRows
function, it can be inlined and
entirely removed using copy propagation since it only contains
assignments. However, for some other ciphers like Ascon, Serpent,
Clyde or Skinny, the linear layers contains xor
s, which cannot be
optimized away. Similarly, the AddRoundKey
step of most ciphers
introduces a lot of xor
s.
Overall, after bitslicing, the main function of Rectangle thus looks like:
state := AddRoundKey(state, key[0]); -- <--- lots of spilling in this node
state[0..3] := Sbox(state[0..3]);
state[4..7] := Sbox(state[4..7]);
...
state[60..63] := Sbox(state[60..63]);
state := LinearLayer(state); -- <--- lots of spilling in this node
state := AddRoundKey(state, key[1]); -- <--- lots of spilling in this node
state[0..3] := Sbox(state[0..3]);
state[4..7] := Sbox(state[4..7]);
...
state[60..63] := Sbox(state[60..63]);
...
We designed a scheduling algorithm that aims at interleaving the linear layers (and key additions) with S-box calls in order to reduce the live ranges of some variables and thus reduce the need for spilling.
This algorithm requires a first inlining step that heuristically tries to recognize good candidates for inlining. We experimentally determined three criterions for this first inlining pass that yield good results:
if a node contains more than than 31 inputs and outputs, it’s very unlikely to be an S-box and should therefore be inlined. The largest S-boxes we know of take 8 inputs and return 8 outputs, like in AES or Camellia.
if a node contains more than 65% of assigmnents, it is quite likely to be doing some kind of permutation, and should therefore be inlined. Since most of the instructions in S-boxes are bitwise instructions, a node containing 65% is thus unlikely to be an S-box.
if a node contains less than 10 instructions, it is unlikely to be a S-box and is therefore inlined. The shortest S-box we know of are Gift’s with 11 instructions, and Rectangle’s with 12 instructions. Those nodes with few instructions often perform constant addition, key addition, or are used to factorize the S-box (like in Skinny for instance).
After this inlining step, Rectangle’s code would thus be:
state[0] := state[0] ^ key[0][0]; -- AddRoundKey 1
state[1] := state[1] ^ key[0][1];
state[2] := state[2] ^ key[0][2];
state[3] := state[3] ^ key[0][3];
state[4] := state[4] ^ key[0][4];
state[5] := state[5] ^ key[0][5];
state[6] := state[6] ^ key[0][6];
state[7] := state[7] ^ key[0][7];
state[8] := state[8] ^ key[0][8];
state[9] := state[9] ^ key[0][9];
...
state[63] := state[63] ^ key[0][63];
state[0..3] := Sbox(state[0..3]); -- S-boxes 1
state[4..7] := Sbox(state[4..7]);
state[8..11] := Sbox(state[8..11]);
...
state[60..63] := Sbox(state[60..63]);
state[0] := state[0]; -- Linear layer 1
state[1] := state[1];
state[2] := state[2];
state[3] := state[3];
state[4] := state[4];
state[5] := state[5];
state[6] := state[6];
...
state[63] := state[60];
state[0] := state[0] ^ key[1][0]; -- AddRoundKey 2
state[1] := state[1] ^ key[1][1];
state[2] := state[2] ^ key[1][2];
state[3] := state[3] ^ key[1][3];
state[4] := state[4] ^ key[1][4];
state[5] := state[5] ^ key[1][5];
...
state[63] := state[63] ^ key[1][63];
state[0..3] := Sbox(state[0..3]); -- S-boxes 2
state[4..7] := Sbox(state[4..7]);
...
The inlined linear and key addition introduce a lot of spilling since they use a lot of variables a single time. After our scheduling algorithm, Rectangle’s code will be:
state[0] := state[0] ^ key[0][0]; -- Part of AddRoundKey 1
state[1] := state[1] ^ key[0][1];
state[2] := state[2] ^ key[0][2];
state[3] := state[3] ^ key[0][3];
state[0..3] := Sbox(state[0..3]); -- S-box 1
state[4] := state[4] ^ key[0][4]; -- Part of AddRoundKey 1
state[5] := state[5] ^ key[0][5];
state[6] := state[6] ^ key[0][6];
state[7] := state[7] ^ key[0][7];
state[4..7] := Sbox(state[4..7]); -- S-box 1
...
...
state[0] := state[0]; -- Part of Linear layer 1
state[1] := state[1];
state[2] := state[2];
state[3] := state[3];
state[0] := state[0] ^ key[1][0]; -- Part of AddRoundKey 2
state[1] := state[1] ^ key[1][1];
state[2] := state[2] ^ key[1][2];
state[3] := state[3] ^ key[1][3];
state[0..3] := Sbox(state[0..3]); -- S-box 2
state[4] := state[4]; -- Part of Linear layer 1
state[5] := state[5];
state[6] := state[6];
state[7] := state[7];
state[4] := state[4] ^ key[1][4]; -- Part of AddRoundKey 2
state[5] := state[5] ^ key[1][5];
state[6] := state[6] ^ key[1][6];
state[7] := state[7] ^ key[1][7];
state[4..7] := Sbox(state[0..3]); -- S-box 2
...
...
Our scheduling algorithm aims at reducing the lifespan of variables computed in the linear layers by interleaving linear layer with S-boxes calls: the parameters of the S-boxes are computed right before S-box calls, thus removing the need to spill them. More formally, the pseudo-code of this algorithm follows:
procedure Schedule(prog):
for each function call funCall of prog do
for each variable v in funCall's arguments do
if v's definition is not scheduled yet then
schedule v's definition (and dependencies) next
schedule funCall next
The optimization of the automatic bitslicing presented earlier is crucial for this scheduling algorithm to work: without it, the S-box is a single large function rather than several calls to small S-boxes, and no interleaving can happen. Similarly, the inlining pass than happens before the scheduling itself is essential to ensure that the instructions of the linear layer are inlined and ready to be interleaved with calls to the S-boxes.
We evaluated this algorithm on 11 ciphers containing a distinct S-box and linear layer, where the S-box only performs bitwise instructions. We compiled the generated C codes with gcc 8.3, and ran the benchmarks on a Intel i5 6500. The results are shown in the following table:
algorithm | Bitslice scheduling speedup | |||||
---|---|---|---|---|---|---|
gcc -Os | gcc -O3 | clang -O3 | ||||
x86 | AVX2 | x86 | AVX2 | x86 | AVX2 | |
ACE* | 1.00 | 1.00 | 1.02 | 1.01 | 1.00 | 0.99 |
AES | 1.04 | 0.99 | 1.06 | 0.98 | 1.08 | 1.04 |
Ascon | 1.35 | 1.32 | 1.25 | 1.27 | 1.04 | 1.19 |
Clyde | 1.06 | 1.06 | 1.06 | 1.04 | 1.01 | 0.99 |
DES | 1.16 | 1.19 | 1.16 | 1.23 | 1.01 | 1.02 |
Gimli* | 1.00 | 0.99 | 1.01 | 1.00 | 1.00 | 1.01 |
Gift | 1.16 | 1.18 | 1.12 | 1.16 | 1.04 | 1.09 |
Photon | 1.05 | 1.14 | 0.97 | 0.93 | 0.96 | 0.97 |
Present | 1.30 | 1.10 | 1.16 | 1.16 | 1.00 | 1.00 |
Pyjamask | 1.19 | 1.35 | 1.04 | 1.04 | 0.99 | 1.00 |
Rectangle | 1.28 | 1.20 | 1.15 | 1.15 | 1.00 | 0.99 |
Serpent | 1.18 | 1.20 | 1.20 | 1.20 | 1.04 | 1.00 |
Skinny | 1.14 | 1.16 | 1.18 | 1.18 | 1.03 | 1.14 |
Spongent* | 1.00 | 1.00 | 1.00 | 1.02 | 1.00 | 0.99 |
Subterranean* | 1.00 | 0.99 | 1.00 | 1.01 | 1.00 | 1.00 |
Xoodoo* | 1.00 | 1.01 | 1.00 | 1.00 | 0.98 | 1.00 |
Our scheduling algorithm is clearly beneficial, yielding speedups of up to x1.35. The ciphers marked with a * are not supposed to benefit from this scheduling algorithm, since they do not contain a small S-box composed of bitwise instructions alternating with a large linear layer. We still provide their performances to show that our algorithm is not too detrimental to them: on average, it does not change their performances. Note that all the speedups are significant (even a x1.01 speedup) according to Wilcoxon’s rank-sum test [19].
The exact benefits of this scheduling algorithm however greatly vary from one cipher to the other, from one architecture to the other, and from one compiler to the other. For instance, while on general purpose register our algorithm improves AES’s performance by a factor 1.04, it actually reduces it by 0.99 on AVX2 registers. On Present, our algorithm is clearly more efficient on general purpose registers, while on Pyjamask, it is clearly better on AVX2 registers.
We do not provide a detailed analysis of the performances because of the heuristic nature the results: our algorithm relies on a heuristic inlining, and is only meant to help GCC/Clang’s scheduler, which are themselves heuristic. In any case, Usubac’s autotuner will disable this optimization when it is detrimental to performance.
Unlike bitsliced code, m-sliced programs have much lower register pressure. When targetting deeply-pipelined superscalar architectures (like high-end Intel CPUs), spilling is less of an issue, the latency of the few resulting load and store operations being hidden by the CPU pipeline. Instead, the challenge consists in being able to saturate the CPU execution units. To do so, one must increase ILP, taking into account data hazards. Consider for instance Clyde’s linear layer:
node lbox(x,y:u32) returns (xr,yr:u32)
vars
a, b, c, d: u32
let
a = x ^ (x >>> 12);
b = y ^ (y >>> 12);
a := a ^ (a >>> 3);
b := b ^ (b >>> 3);
a := a ^ (x >>> 17);
b := b ^ (y >>> 17);
c = a ^ (a >>> 31);
d = b ^ (b >>> 31);
a := a ^ (d >>> 26);
b := b ^ (c >>> 25);
a := a ^ (c >>> 15);
b := b ^ (d >>> 15);
(xr, yr) = (a,b)
tel
Only 2 instructions per cycles can ever be computed because of data
dependencies: x >>> 12
and y >>> 12
during the first cycle, then
x ^ (x >>> 12)
and y ^ (y >>> 12)
in the second cycle, a >>> 3
and b >>> 3
in the third one, etc. However, after inlining, it may
be possible to partially interleave this function with other parts of
the cipher (for instance the S-box or the key addition).
One may hope for the out-of-order nature of modern CPU to alleviate this issue, allowing the processor to execute independent instructions ahead in the execution streams. However, due to the bulky nature of sliced code, we observe that the reservation station is quickly saturated, preventing actual out-of-order execution. Unfortunately, C compilers have their own heuristic to schedule the instructions, and they often fail to generate code that is optimal with regard to data hazards.
We statically increase ILP in Usubac by maintaining a look-behind window of the previous 10 instructions. To schedule an instruction, we pick one with no data hazard with the instructions in the look-behind window and that would execute on a different execution unit among the subsequent instructions. If no such instruction can be found, we reduce the size of the look-behind window, and, ultimately just schedule any instruction that is ready. While the C compilers are free to completely undo this optimization when they perform their own scheduling, we observe significant performance improvements by explicitely performing this first scheduling pass ourselves.
The speedup gained by using this scheduling algorithm are summarized in the following table (using Clang 7.0.0):
algorithm | Mitslice scheduling speedup | |
---|---|---|
x86 | AVX2 | |
ACE | 1.35 | 1.39 |
AES | - | 1.04 |
Ascon | 1.10 | 1.04 |
Chacha20 | 1.05 | 1.10 |
Clyde | 1.02 | 1.04 |
Gift | 1.01 | 1.00 |
Gimli | 0.98 | 1.01 |
Rectangle (V) | 0.97 | 1.00 |
Rectangle (H) | - | 1.02 |
Serpent | 0.97 | 1.00 |
Xoodoo | 1.03 | 1.00 |
This algorithm is particularly useful with ciphers that rely on
functions with heavy data dependencies. For instance, consider ACE’s
f
function:
node f(x:u32) returns (y:u32)
let
y = ((x <<< 5) & x) ^ (x <<< 1)
tel
It contains only 4 instructions, but cannot execute in less than 3
cycles due to data-dependencies. However, looking at the bigger
picture of ACE, this function is wrapped inside the so-called
Simeck-box
:
node simeck_box(input:u32x2, rc:u32) returns (output:u32x2)
vars round:u32x2[9]
let
round[0] = input;
forall i in [0, 7] {
round[i+1] = ( f(round[i][0]) ^ round[i][1] ^
0xfffffffe ^ ((rc >> i) & 1),
round[i][0] ) ;
}
output = round[8]
tel
This function is bottelnecking because of data-dependencies as well, but it is actually called 3 times consecutively each round, with independent inputs:
node ACE_step(A,B,C,D,E:u32x2,RC,SC:u32[3]) returns (Ar,Br,Cr,Dr,Er:u32x2)
let
A := simeck_box(A,RC[0]);
C := simeck_box(C,RC[1]);
E := simeck_box(E,RC[2]);
...
After inlining and unrolling simeck_box
, the effect of our
scheduling algorithm will be to interleave those 3 simeck_box
, thus
removing any pipeline stalls from data hazard. We thus observe that
this scheduling algorithm brings up the number of instructions
executed per cycle (IPC) from 1.93 to 2.67 on AVX2 registers,
translating into a x1.35 speedup.
Similarly, Chacha20’s round is composed of 4 functions called quarter rounds:
node QuarterRound (a:u<V>32, b:u<V>32, c:u<V>32, d:u<V>32)
returns (aR:u<V>32, bR:u<V>32, cR:u<V>32, dR:u<V>32)
let
a := a + b;
d := (d ^ a) <<< 16;
c := c + d;
b := (b ^ c) <<< 12;
aR = a + b;
dR = (d ^ aR) <<< 8;
cR = c + dR;
bR = (b ^ cR) <<< 7;
tel
The function, despite containing only 12 instructions, cannot be executed in less than 12 cycles: none of the instructions of this function can be computed in the same cycle, since each of them has a dependency with the previous one. However, this function is called consecutively 4 times on independent inputs:
node DoubleRound (state:u<V>32x16) returns (stateR:u<V>32x16)
let
state[0,4,8,12] := QR(state[0,4,8,12]);
state[1,5,9,13] := QR(state[1,5,9,13]);
state[2,6,10,14] := QR(state[2,6,10,14]);
state[3,7,11,15] := QR(state[3,7,11,15]);
...
Our scheduling algorithm is thus able to interleave each of those calls, allowing each of them to be executed simultaneously, and removing stalls from data dependencies. We thus observe than on general purpose registers, using this algorithm on Chacha20 brings up the IPC from 2.79 to 3.44, which translates directly into a x1.05 speedup.
For other ciphers likes AES, Ascon and Clyde, the speedup is lesser
but still a significant at x1.04. AES’s ShiftRows
contains 8 shuffle
instructions and nothing else, and thus cannot be executed in less
than 8 cycles. This algorithm allows this function to be ran
simultaneously with either the S-box (SubBytes
) or the MixColumn
step. Both Clyde’s and Ascon’s linear layer are bottlenecked by
data-dependencies. This algorithm allows those linear layers to be
interleaved with the S-boxes, thus reducing data hazards.
Look-behind window size
The look-behind window needs to be at least n-1 on an architecture than can execute n bitwise/arithmetic instructions per cycle. However, experimentally, increasing the size of the look-behind window beyond this number leads to better performances.
To illustrate the impact of the look-behind window size, let’s assume that we want to run the C code below on a CPU that can compute two xors each cycle:
a1 = x1 ^ x2;
a2 = x3 ^ x4;
b1 = a1 ^ x5;
b2 = a2 ^ x6;
c1 = x7 ^ x8;
c2 = c1 ^ x9;
As is, this code would be executed in 4 cycles: during the first one,
a1
and a2
would be computed, and during the second one b1
and
b2
. However, c2
depends on c1
and thus cannot be computed during
the same cycle.
If we were to re-schedule this code using our algorithm with a look-behind window of 1 instructions, the final scheduling would not change (assuming that ties are broken using the initial order of the source, ie that when 2 instructions could be scheduled without dependencies with the previous ones, the one that came first in the initial code is selected).
Using a look-behind window of 2 instructions however would produce the following scheduling:
a1 = x1 ^ x2;
a2 = x3 ^ x4;
c1 = x7 ^ x8;
b1 = a1 ^ x5;
b2 = a2 ^ x6;
c2 = c1 ^ x9;
This code can now run in 3 cycles rather than 4. This shows why using a look-behind window of 2 instructions for SSE/AVX architectures is not always optimal. In practice, the complexity of the algorithm depends on the size of the look-behind window, and using arbitrarily large look-behind windows would be prohibitively expensive. Experimentally, there is very little to gain by using more than 10 instructions in the look-behind window, and this size allows the algorithm to still be fairly fast.
[1] R. Sethi, Complete Register Allocation Problems, STOC, 1973.
[2] G. J. Chaitin et al., Register allocation via coloring, 1981.
[3] G. J. Chaitin, Register allocation & spilling via graph coloring, CC, 1982.
[4] P. Briggs et al., Coloring heuristics for register allocation, PLDI, 1989.
[5] P. Briggs et al., Improvements to graph coloring register allocation, 1994.
[6] M. Poletto, V. Sarkar, Linear scan register allocation, 1999.
[7] B. Alpern et al., The Jalapeño virtual machine, 2000.
[8] T. Kotzmann et al., Design of the Java HotSpot client compiler for Java 6, 2008.
[9] C. Wimmer, M. Franz, Linear scan register allocation on SSA form, CGO, 2010.
[10] H. Mössenböck, M. Pfeiffer, Linear Scan Register Allocation in the Context of SSA Form and Register Constraints, CC, 2002.
[11] P. B. Gibbons, S. S. Muchnick, Efficient instruction scheduling for a pipelined architecture, CC, 1986.
[12] D. Bernstein, M. Rodeh, Global instruction scheduling for superscalar machines, PLDI, 1991.
[13] R. F. Touzeau, A Fortran compiler for the FPS-164 scientific computer, CC, 1984.
[14] J. R. Goodman, W.-C. Hsu, Code scheduling and register allocation in large basic blocks, ICS, 1988.
[15] S. S. Pinter, Register allocation with instruction scheduling: a new approach, PLDI, 1993.
[16] R. Motwani et al., Combining register allocation and instruction scheduling, 1995.
[17] M. D. Smith, G. Holloway, Graph-Coloring Register Allocation for Irregular Architectures, 2001.
[18] V. N. Makarov, The Integrated Register Allocator for GCC, 2007.
[19] F. Wilcoxon, Individual comparisons by ranking methods, 1992.