Scheduling

Published on 2020-05-09 by Darius Mercadier

Introduction

The C codes generated by Usubac are compiled to assembly using C compilers. While in C, a (virtually) unlimited amount of variables can be used, CPUs only offer a few registers to work with (between 8 and 32 for commonly used CPUs). C variables are thus mapped to assembly registers using a register allocation algorithms. When more variables are alive thant there are available registers, some variables are spilled: the stack is used to temporary store them.

R. Sethi [1] showed that determining whether a program can be computed using k registers is NP-complete. Heuristic methods are thus used in compilers to map variables to registers. The most classical ones are based on graph coloring [2,3,4,17], and bound to be approximate since graph coloring itself is NP-complete. Another commonly used approach is an algorithm called linear scan [6], which is favored by just-in-time compilers for its speed, at the expense of the quality of the allocation [7,8,9,10].

Register allocation is tightly coupled with instruction scheduling, which consists in ordering the instructions in the best way to take advantage of a CPU’s superscalar microarchitecture [11,12]. To illustrate the importance of instruction scheduling, consider the following sequence of instructions:

u = a + b;
v = u + 2;
w = c + d;
x = e + f;

Consider a CPU that can compute two additions per cycle and performs no reordering. Such a CPU would execute u = a + b in its first cycle, and would be unable to compute v = u + 2 in the same cycle since u has not been computed yet. In a second cycle, it would compute v = u + 2 as well as w = c + d, and, finally, in a third cycle, it would compute x = e + f. On the other hand, if the code had been scheduled as:

u = a + b;
w = c + d;
v = u + 2;
x = e + f;

It could have been executed in only 2 cycles. While out-of-order CPUs reduce the impact of such data hazards at run-time, this phenomenons still occur and need to be taken into account when statically scheduling instructions.

The interaction between register allocation and instruction scheduling is due to spilling, which introduces memory operations. The cost of those additional memory operations can sometimes be reduced by using a efficient instruction scheduling. Better still, combining register allocation and instruction scheduling can produce more efficient code that performing them separately [13,14,15,16].

In practice, however, C compilers strive to still offer small compilation times, sometimes at the expense of the quality of the generated code. Both GCC and LLVM thus perform instruction scheduling and register allocation separately, starting instruction scheduling, followed by register allocation, followed by some additional scheduling.

Despite reusing traditional graph coloring algorithms, GCC and LLVM’s instruction schedulers and register allocators are highly tuned to produce highly efficient codes, and are the product of decades of tweaking. GCC’s register allocator and instruction scheduler thus amount to more than 50.000 lines of C code, and LLVM’s to more than 30.000 lines of code.

We cash out on both these works by providing two pre-schedulers, in order to produce C codes that will be better optimized by C compilers’ schedulers and register allocators. The first scheduler reduces register pressure in bitsliced ciphers, while the second increases instruction-level parallelism in msliced ciphers.

Bitslicing

In bitsliced code, the major bottleneck is register pressure: a significant portion of the execution time is spent spilling registers to and from the stack. Given that hundreds of variables can be alive at the same time in a bitsliced cipher, it is not surprising that the C compilers have a hard time keeping the register pressure down. For instance, it is common for bitsliced ciphers to have up to 40% of their instructions being loads and stores for spilling, as shown in the following table (usuba-generated ciphers compiled with Clang 7.0.0):

Cipher	% of instructions related to spilling
DES	41%
Ascon	43%
Gift	38%
Present	25%
Rectangle	36%
Serpent	39%

We designed a scheduling algorithm that aims at reducing register pressure (and thus improve performances) in bitsliced codes.

Let’s take the example of Rectangle to illustrate why bitsliced codes have so much spilling and how this can be improved. Rectangle’s S-box can be written in Usuba as:

node sbox (a0:u32, a1:u32, a2:u32, a3:u32)
     returns  (b0:u32, b1:u32, b2:u32, b3:u32)
vars
    t1:u32, t2:u32, t3:u32, t4:u32, t5:u32, t6:u32,
    t7:u32, t8:u32, t9:u32, t10:u32, t11:u32, t12:u32
let
    t1 = ~a1;
    t2 = a0 & t1;
    t3 = a2 ^ a3;
    b0 = t2 ^ t3;
    t5 = a3 | t1;
    t6 = a0 ^ t5;
    b1 = a2 ^ t6;
    t8 = a1 ^ a2;
    t9 = t3 & t6;
    b3 = t8 ^ t9;
    t11 = b0 | t8;
    b2 = t6 ^ t11
tel

Usubac naive automatic bitslicing would replace all u32 with b32, which would cause the variables to become vectors of 32 boolean elements, thus causing the operators to be unfolded as follows:

node sbox (a0:b32, a1:b32, a2:b32, a3:b32)
     returns  (b0:b32, b1:b32, b2:b32, b3:b32)
vars ...
let
    t1[0] = ~a1[0];
    t1[1] = ~a1[1];
    t1[2] = ~a1[2];
    ...
    t1[31] = ~a1[31];
    t2[0] = a0[0] & t1[0];
    t2[1] = a0[1] & t1[1];
    ...
    t2[31] = a0[31] & t1[31];
    ...

Optimizing the automatic bitslicing

While the initial S-box contained less than 10 variables simultaneously alive, the bitsliced S-box contains 32 times more variables alive at any point. A first optimization to reduce the register pressure thus happens during the automatic bitslicing: Usubac detects nodes computing only bitwise instructions and calls to nodes with the same property. Those nodes are specialized to take b1 variables as input, and their call sites are replaced with several calls instead of one. In the case of Rectangle, the shorter S-box is thus called 32 times rather than calling the large S-box once. If the initial pre-bitslicing S-box call was:

(x0, x1, x2, x3) = sbox(y0, y1, y2, y3)

with x0, x1, x2, x3, y0, y1, y2, y3 of type u32. Then, the naive bitsliced call would remain the same except that variables would be typed b32, and sbox would be the modified version that takes b32 as inputs and repeats each operation 32 times. The optimized bitsliced version however is:

(x0[0], x1[0], x2[0], x3[0]) = sbox(y0[0], y1[0], y2[0], y3[0]);
(x0[1], x1[1], x2[1], x3[1]) = sbox(y0[1], y1[1], y2[1], y3[1]);
...
(x0[31], x1[31], x2[31], x3[31]) = sbox(y0[31], y1[31], y2[31], y3[31]);

Where sbox calls the initial sbox specialized with input types b1.

Concretely, this optimization reduces register pressure, at the expense of more function calls. The benefits heavily depends on the ciphers and C compilers used. On AVX registers using Clang as C compiler, this yields a 34% speedup on Ascon, 12% on Gift and 6% on Clyde.

The same optimization cannot be applied to linear layers however, as they almost always contain rotations, shifts or permutations. Consider for instance Rectangle’s linear layer:

node ShiftRows (input:u16x4) returns (out:u16x4)
let
    out[0] = input[0];
    out[1] = input[1] <<< 1;
    out[2] = input[2] <<< 12;
    out[3] = input[3] <<< 13
tel

Using a u1x4 input instead of a u16x4 is not possible due to the left rotations, which requires all 16 bits of each input. Instead, this function will be bitsliced and become:

node ShiftRows (input:b1[4][16]) returns (out:b1[4][16])
let 
    out[0][0] = input[0][0];       -- out[0] = input[0]
    out[0][1] = input[0][1];
    ...
    out[0][15] = input[0][15];
    out[1][0] = input[1][1];       -- out[1] = input[1] <<< 1
    out[1][1] = input[1][2];
    ...
    out[1][15] = input[1][0];
    out[2][0] = input[2][12];      -- out[2] = input[2] <<< 12
    out[2][1] = input[2][13];
    ...
    out[2][15] = input[2][11];
    out[3][0] = input[3][13];      -- out[3] = input[3] <<< 13
    out[3][1] = input[3][14];
    ...
    out[3][15] = input[3][12];
tel

In the case of Rectangle’s ShiftRows function, it can be inlined and entirely removed using copy propagation since it only contains assignments. However, for some other ciphers like Ascon, Serpent, Clyde or Skinny, the linear layers contains xors, which cannot be optimized away. Similarly, the AddRoundKey step of most ciphers introduces a lot of xors.

Overall, after bitslicing, the main function of Rectangle thus looks like:

state := AddRoundKey(state, key[0]);    -- <--- lots of spilling in this node
state[0..3] := Sbox(state[0..3]);
state[4..7] := Sbox(state[4..7]);
...
state[60..63] := Sbox(state[60..63]);
state := LinearLayer(state);            -- <--- lots of spilling in this node
state := AddRoundKey(state, key[1]);    -- <--- lots of spilling in this node
state[0..3] := Sbox(state[0..3]);
state[4..7] := Sbox(state[4..7]);
...
state[60..63] := Sbox(state[60..63]);
...

Bitslice code scheduling

We designed a scheduling algorithm that aims at interleaving the linear layers (and key additions) with S-box calls in order to reduce the live ranges of some variables and thus reduce the need for spilling.

This algorithm requires a first inlining step that heuristically tries to recognize good candidates for inlining. We experimentally determined three criterions for this first inlining pass that yield good results:

if a node contains more than than 31 inputs and outputs, it’s very unlikely to be an S-box and should therefore be inlined. The largest S-boxes we know of take 8 inputs and return 8 outputs, like in AES or Camellia.
if a node contains more than 65% of assigmnents, it is quite likely to be doing some kind of permutation, and should therefore be inlined. Since most of the instructions in S-boxes are bitwise instructions, a node containing 65% is thus unlikely to be an S-box.
if a node contains less than 10 instructions, it is unlikely to be a S-box and is therefore inlined. The shortest S-box we know of are Gift’s with 11 instructions, and Rectangle’s with 12 instructions. Those nodes with few instructions often perform constant addition, key addition, or are used to factorize the S-box (like in Skinny for instance).

After this inlining step, Rectangle’s code would thus be:

state[0] := state[0] ^ key[0][0];        -- AddRoundKey 1
state[1] := state[1] ^ key[0][1];
state[2] := state[2] ^ key[0][2];
state[3] := state[3] ^ key[0][3];
state[4] := state[4] ^ key[0][4];
state[5] := state[5] ^ key[0][5];
state[6] := state[6] ^ key[0][6];
state[7] := state[7] ^ key[0][7];
state[8] := state[8] ^ key[0][8];
state[9] := state[9] ^ key[0][9];
...
state[63] := state[63] ^ key[0][63];
state[0..3] := Sbox(state[0..3]);        -- S-boxes 1
state[4..7] := Sbox(state[4..7]);
state[8..11] := Sbox(state[8..11]);
...
state[60..63] := Sbox(state[60..63]);
state[0] := state[0];                    -- Linear layer 1
state[1] := state[1];
state[2] := state[2];
state[3] := state[3];
state[4] := state[4];
state[5] := state[5];
state[6] := state[6];
...
state[63] := state[60];
state[0] := state[0] ^ key[1][0];        -- AddRoundKey 2
state[1] := state[1] ^ key[1][1];
state[2] := state[2] ^ key[1][2];
state[3] := state[3] ^ key[1][3];
state[4] := state[4] ^ key[1][4];
state[5] := state[5] ^ key[1][5];
...
state[63] := state[63] ^ key[1][63];
state[0..3] := Sbox(state[0..3]);        -- S-boxes 2
state[4..7] := Sbox(state[4..7]);
...

The inlined linear and key addition introduce a lot of spilling since they use a lot of variables a single time. After our scheduling algorithm, Rectangle’s code will be:

state[0] := state[0] ^ key[0][0];        -- Part of AddRoundKey 1
state[1] := state[1] ^ key[0][1];
state[2] := state[2] ^ key[0][2];
state[3] := state[3] ^ key[0][3];
state[0..3] := Sbox(state[0..3]);        -- S-box 1
state[4] := state[4] ^ key[0][4];        -- Part of AddRoundKey 1
state[5] := state[5] ^ key[0][5];
state[6] := state[6] ^ key[0][6];
state[7] := state[7] ^ key[0][7];
state[4..7] := Sbox(state[4..7]);        -- S-box 1
...
...
state[0] := state[0];                    -- Part of Linear layer 1
state[1] := state[1];
state[2] := state[2]; 
state[3] := state[3];
state[0] := state[0] ^ key[1][0];        -- Part of AddRoundKey 2
state[1] := state[1] ^ key[1][1];
state[2] := state[2] ^ key[1][2];
state[3] := state[3] ^ key[1][3];
state[0..3] := Sbox(state[0..3]);        -- S-box 2
state[4] := state[4];                    -- Part of Linear layer 1
state[5] := state[5];
state[6] := state[6]; 
state[7] := state[7];
state[4] := state[4] ^ key[1][4];        -- Part of AddRoundKey 2
state[5] := state[5] ^ key[1][5];
state[6] := state[6] ^ key[1][6];
state[7] := state[7] ^ key[1][7];
state[4..7] := Sbox(state[0..3]);        -- S-box 2
...
...

Our scheduling algorithm aims at reducing the lifespan of variables computed in the linear layers by interleaving linear layer with S-boxes calls: the parameters of the S-boxes are computed right before S-box calls, thus removing the need to spill them. More formally, the pseudo-code of this algorithm follows:

procedure Schedule(prog):
    for each function call funCall of prog do
        for each variable v in funCall's arguments do
            if v's definition is not scheduled yet then
                schedule v's definition (and dependencies) next
        schedule funCall next

The optimization of the automatic bitslicing presented earlier is crucial for this scheduling algorithm to work: without it, the S-box is a single large function rather than several calls to small S-boxes, and no interleaving can happen. Similarly, the inlining pass than happens before the scheduling itself is essential to ensure that the instructions of the linear layer are inlined and ready to be interleaved with calls to the S-boxes.

Performances

We evaluated this algorithm on 11 ciphers containing a distinct S-box and linear layer, where the S-box only performs bitwise instructions. We compiled the generated C codes with gcc 8.3, and ran the benchmarks on a Intel i5 6500. The results are shown in the following table:

algorithm	Bitslice scheduling speedup
	gcc -Os		gcc -O3		clang -O3
	x86	AVX2	x86	AVX2	x86	AVX2
ACE*	1.00	1.00	1.02	1.01	1.00	0.99
AES	1.04	0.99	1.06	0.98	1.08	1.04
Ascon	1.35	1.32	1.25	1.27	1.04	1.19
Clyde	1.06	1.06	1.06	1.04	1.01	0.99
DES	1.16	1.19	1.16	1.23	1.01	1.02
Gimli*	1.00	0.99	1.01	1.00	1.00	1.01
Gift	1.16	1.18	1.12	1.16	1.04	1.09
Photon	1.05	1.14	0.97	0.93	0.96	0.97
Present	1.30	1.10	1.16	1.16	1.00	1.00
Pyjamask	1.19	1.35	1.04	1.04	0.99	1.00
Rectangle	1.28	1.20	1.15	1.15	1.00	0.99
Serpent	1.18	1.20	1.20	1.20	1.04	1.00
Skinny	1.14	1.16	1.18	1.18	1.03	1.14
Spongent*	1.00	1.00	1.00	1.02	1.00	0.99
Subterranean*	1.00	0.99	1.00	1.01	1.00	1.00
Xoodoo*	1.00	1.01	1.00	1.00	0.98	1.00

Our scheduling algorithm is clearly beneficial, yielding speedups of up to x1.35. The ciphers marked with a * are not supposed to benefit from this scheduling algorithm, since they do not contain a small S-box composed of bitwise instructions alternating with a large linear layer. We still provide their performances to show that our algorithm is not too detrimental to them: on average, it does not change their performances. Note that all the speedups are significant (even a x1.01 speedup) according to Wilcoxon’s rank-sum test [19].

The exact benefits of this scheduling algorithm however greatly vary from one cipher to the other, from one architecture to the other, and from one compiler to the other. For instance, while on general purpose register our algorithm improves AES’s performance by a factor 1.04, it actually reduces it by 0.99 on AVX2 registers. On Present, our algorithm is clearly more efficient on general purpose registers, while on Pyjamask, it is clearly better on AVX2 registers.

We do not provide a detailed analysis of the performances because of the heuristic nature the results: our algorithm relies on a heuristic inlining, and is only meant to help GCC/Clang’s scheduler, which are themselves heuristic. In any case, Usubac’s autotuner will disable this optimization when it is detrimental to performance.

mslicing

Unlike bitsliced code, m-sliced programs have much lower register pressure. When targetting deeply-pipelined superscalar architectures (like high-end Intel CPUs), spilling is less of an issue, the latency of the few resulting load and store operations being hidden by the CPU pipeline. Instead, the challenge consists in being able to saturate the CPU execution units. To do so, one must increase ILP, taking into account data hazards. Consider for instance Clyde’s linear layer:

node lbox(x,y:u32) returns (xr,yr:u32)
vars
    a, b, c, d: u32
let
    a  = x ^ (x >>> 12);
    b  = y ^ (y >>> 12);
    a := a ^ (a >>> 3);
    b := b ^ (b >>> 3);
    a := a ^ (x >>> 17);
    b := b ^ (y >>> 17);
    c  = a ^ (a >>> 31);
    d  = b ^ (b >>> 31);
    a := a ^ (d >>> 26);
    b := b ^ (c >>> 25);
    a := a ^ (c >>> 15);
    b := b ^ (d >>> 15);
    (xr, yr) = (a,b)
tel

Only 2 instructions per cycles can ever be computed because of data dependencies: x >>> 12 and y >>> 12 during the first cycle, then x ^ (x >>> 12) and y ^ (y >>> 12) in the second cycle, a >>> 3 and b >>> 3 in the third one, etc. However, after inlining, it may be possible to partially interleave this function with other parts of the cipher (for instance the S-box or the key addition).

One may hope for the out-of-order nature of modern CPU to alleviate this issue, allowing the processor to execute independent instructions ahead in the execution streams. However, due to the bulky nature of sliced code, we observe that the reservation station is quickly saturated, preventing actual out-of-order execution. Unfortunately, C compilers have their own heuristic to schedule the instructions, and they often fail to generate code that is optimal with regard to data hazards.

We statically increase ILP in Usubac by maintaining a look-behind window of the previous 10 instructions. To schedule an instruction, we pick one with no data hazard with the instructions in the look-behind window and that would execute on a different execution unit among the subsequent instructions. If no such instruction can be found, we reduce the size of the look-behind window, and, ultimately just schedule any instruction that is ready. While the C compilers are free to completely undo this optimization when they perform their own scheduling, we observe significant performance improvements by explicitely performing this first scheduling pass ourselves.

The speedup gained by using this scheduling algorithm are summarized in the following table (using Clang 7.0.0):

algorithm	Mitslice scheduling speedup
algorithm	x86	AVX2
ACE	1.35	1.39
AES	-	1.04
Ascon	1.10	1.04
Chacha20	1.05	1.10
Clyde	1.02	1.04
Gift	1.01	1.00
Gimli	0.98	1.01
Rectangle (V)	0.97	1.00
Rectangle (H)	-	1.02
Serpent	0.97	1.00
Xoodoo	1.03	1.00

This algorithm is particularly useful with ciphers that rely on functions with heavy data dependencies. For instance, consider ACE’s f function:

node f(x:u32) returns (y:u32)
let
    y = ((x <<< 5) & x) ^ (x <<< 1)
tel

It contains only 4 instructions, but cannot execute in less than 3 cycles due to data-dependencies. However, looking at the bigger picture of ACE, this function is wrapped inside the so-called Simeck-box:

node simeck_box(input:u32x2, rc:u32) returns (output:u32x2)
vars round:u32x2[9]
let
    round[0] = input;
    forall i in [0, 7] {
      round[i+1] = ( f(round[i][0]) ^ round[i][1] ^ 
                      0xfffffffe ^ ((rc >> i) & 1), 
                     round[i][0] ) ;
    }
    output = round[8]
tel

This function is bottelnecking because of data-dependencies as well, but it is actually called 3 times consecutively each round, with independent inputs:

node ACE_step(A,B,C,D,E:u32x2,RC,SC:u32[3]) returns (Ar,Br,Cr,Dr,Er:u32x2)
let
    A := simeck_box(A,RC[0]);
    C := simeck_box(C,RC[1]);
    E := simeck_box(E,RC[2]);
    ...

After inlining and unrolling simeck_box, the effect of our scheduling algorithm will be to interleave those 3 simeck_box, thus removing any pipeline stalls from data hazard. We thus observe that this scheduling algorithm brings up the number of instructions executed per cycle (IPC) from 1.93 to 2.67 on AVX2 registers, translating into a x1.35 speedup.

Similarly, Chacha20’s round is composed of 4 functions called quarter rounds:

node QuarterRound (a:u<V>32, b:u<V>32, c:u<V>32, d:u<V>32)
     returns (aR:u<V>32, bR:u<V>32, cR:u<V>32, dR:u<V>32)
let
    a := a + b;
    d := (d ^ a) <<< 16;
    c := c + d;
    b := (b ^ c) <<< 12;
    aR = a + b;
    dR = (d ^ aR) <<< 8;
    cR = c + dR;
    bR = (b ^ cR) <<< 7;
tel

The function, despite containing only 12 instructions, cannot be executed in less than 12 cycles: none of the instructions of this function can be computed in the same cycle, since each of them has a dependency with the previous one. However, this function is called consecutively 4 times on independent inputs:

node DoubleRound (state:u<V>32x16) returns (stateR:u<V>32x16)
let
    state[0,4,8,12]  := QR(state[0,4,8,12]);
    state[1,5,9,13]  := QR(state[1,5,9,13]);
    state[2,6,10,14] := QR(state[2,6,10,14]);
    state[3,7,11,15] := QR(state[3,7,11,15]);
    ...

Our scheduling algorithm is thus able to interleave each of those calls, allowing each of them to be executed simultaneously, and removing stalls from data dependencies. We thus observe than on general purpose registers, using this algorithm on Chacha20 brings up the IPC from 2.79 to 3.44, which translates directly into a x1.05 speedup.

For other ciphers likes AES, Ascon and Clyde, the speedup is lesser but still a significant at x1.04. AES’s ShiftRows contains 8 shuffle instructions and nothing else, and thus cannot be executed in less than 8 cycles. This algorithm allows this function to be ran simultaneously with either the S-box (SubBytes) or the MixColumn step. Both Clyde’s and Ascon’s linear layer are bottlenecked by data-dependencies. This algorithm allows those linear layers to be interleaved with the S-boxes, thus reducing data hazards.

Look-behind window size

The look-behind window needs to be at least n-1 on an architecture than can execute n bitwise/arithmetic instructions per cycle. However, experimentally, increasing the size of the look-behind window beyond this number leads to better performances.

To illustrate the impact of the look-behind window size, let’s assume that we want to run the C code below on a CPU that can compute two xors each cycle:

a1 = x1 ^ x2;
a2 = x3 ^ x4;
b1 = a1 ^ x5;
b2 = a2 ^ x6;
c1 = x7 ^ x8;
c2 = c1 ^ x9;

As is, this code would be executed in 4 cycles: during the first one, a1 and a2 would be computed, and during the second one b1 and b2. However, c2 depends on c1 and thus cannot be computed during the same cycle.

If we were to re-schedule this code using our algorithm with a look-behind window of 1 instructions, the final scheduling would not change (assuming that ties are broken using the initial order of the source, ie that when 2 instructions could be scheduled without dependencies with the previous ones, the one that came first in the initial code is selected).

Using a look-behind window of 2 instructions however would produce the following scheduling:

a1 = x1 ^ x2;
a2 = x3 ^ x4;
c1 = x7 ^ x8;
b1 = a1 ^ x5;
b2 = a2 ^ x6;
c2 = c1 ^ x9;

This code can now run in 3 cycles rather than 4. This shows why using a look-behind window of 2 instructions for SSE/AVX architectures is not always optimal. In practice, the complexity of the algorithm depends on the size of the look-behind window, and using arbitrarily large look-behind windows would be prohibitively expensive. Experimentally, there is very little to gain by using more than 10 instructions in the look-behind window, and this size allows the algorithm to still be fairly fast.

References

[1] R. Sethi, Complete Register Allocation Problems, STOC, 1973.

[2] G. J. Chaitin et al., Register allocation via coloring, 1981.

[3] G. J. Chaitin, Register allocation & spilling via graph coloring, CC, 1982.

[4] P. Briggs et al., Coloring heuristics for register allocation, PLDI, 1989.

[5] P. Briggs et al., Improvements to graph coloring register allocation, 1994.

[6] M. Poletto, V. Sarkar, Linear scan register allocation, 1999.

[7] B. Alpern et al., The Jalapeño virtual machine, 2000.

[8] T. Kotzmann et al., Design of the Java HotSpot client compiler for Java 6, 2008.

[9] C. Wimmer, M. Franz, Linear scan register allocation on SSA form, CGO, 2010.

[10] H. Mössenböck, M. Pfeiffer, Linear Scan Register Allocation in the Context of SSA Form and Register Constraints, CC, 2002.

[11] P. B. Gibbons, S. S. Muchnick, Efficient instruction scheduling for a pipelined architecture, CC, 1986.

[12] D. Bernstein, M. Rodeh, Global instruction scheduling for superscalar machines, PLDI, 1991.

[13] R. F. Touzeau, A Fortran compiler for the FPS-164 scientific computer, CC, 1984.

[14] J. R. Goodman, W.-C. Hsu, Code scheduling and register allocation in large basic blocks, ICS, 1988.

[15] S. S. Pinter, Register allocation with instruction scheduling: a new approach, PLDI, 1993.

[16] R. Motwani et al., Combining register allocation and instruction scheduling, 1995.

[17] M. D. Smith, G. Holloway, Graph-Coloring Register Allocation for Irregular Architectures, 2001.

[18] V. N. Makarov, The Integrated Register Allocator for GCC, 2007.

[19] F. Wilcoxon, Individual comparisons by ranking methods, 1992.