Rarely Repetitive Ramblings: December 2014

Durst (lack of) progress and Rust

Lately, I haven't gotten much quantitatively done on Durst. I've written barely any new code. In part, this is because I keep getting distracted, both with other projects and with every new idea I have for the language - I'm too busy planning Durst to actually make it. This is unfortunate, and I hope I'll get some stuff done over winter break, but for now, I'll talk about the compiler I have been working on: Rust

Rust's compiler is honestly quite ugly, and so I've made a few attempts to clean some of it (specifically the name resolution code) up. However, I've mainly been working on the standard library's data structures. I really enjoy the puzzle of figuring out how to make APIs that are perfectly safe (they make it impossible to, e.g., read uninitialized memory or to use-after-free) yet fully powerful. Recently, I redesigned BTreeMap's internal APIs to make them far safer (and faster).

The core problem is that navigating through a BTree often requires maintaining a stack of the nodes that have been visited to get to the current node. However, any of these nodes could be mutated to invalidate the rest, so Rust won't let us just store a stack of mutable references. As such, there is an internal stack that manages unsafely holding a stack of nodes, only allowing mutation at the top. The issue that arises is figuring out how to push to the stack. If the stack module allowed you to push arbitrary nodes, then it wouldn't work for its designed purpose, namely holding a contiguous path down a BTree. It would be easy to add a node farther down the tree or even from a different tree.

To solve this, had the stack expose the top node with a special 'id lifetime, only accepting a request to push that same node's children. I'm still trying to figure out the best way to explain this, but the PR is here

Backpropagation, memory, and other optimization troubles

In my previous post, I described an intermediate representation that made a number of optimization problems much easier. However, it still has some complications.

Memory

Naively, I could just represent memory operations as side effects, modifying the world. However, this is suboptimal for exactly the reasons that I chose to keep operations like addition out of the normal flow - it hinders optimization. As such, I need some other representation of memory. libFIRM takes the approach of having a global "Store" that it passes in and out of every write and into every read. This has a number of large advantages - it naturally expresses that reads can be reordered and that memory operations are independent of input and output. Moreover, if this is extended with a bunch of separate Stores that can be split and merged, this provides a perfect model for aliasing - a Store can be split into two nonaliasing pieces, and operations on one won't affect the other. However, this has the same issue as the state machine representation of control flow - there is nothing preventing all of the memory from being duplicated.

One idea of had to get around this is to make every write to memory produce a read-only Store and to take a "memory continuation" to the next write. This circumvents the issue of duplicated memory, but comes with its own problems. Merging separate stores becomes inexpressible, and I would need a new MemoryPhi node to handle conditionals in the memory pipeline. Given these problems, I am currently leaning towards a state-based approach, but haven't fully decided yet

Backpropagation

In a simple list of instructions, it is easy to do the sort of dataflow analysis I described in the previous post. However, with my new IR, it becomes harder, involving a lot of going "backwards". This is because just following the control flow chain will not result in every node being analysed, as many operations are out of the control flow. To remedy this, the optimizer has to be able to take a node whose inputs haven't been analysed, analyse the inputs, and then do another analysis of the use. This is tedious, but it also makes things a bit more consistent - with the switch to continuation-based control flow, analysing later instructions involves analysing the arguments of earlier instructions.

The other place where this sort off backward analysis becomes crucial is in situations like the following:

if x == 5:
  y := x
else:
  y := 5

This should be optimized to y := 5. However, to do this, the optimizer has to know that x is 5 when analysing y := x. This is more difficult than it looks, because it has to extract this information from the result of a comparison. It has to go backward in the tree, realizing that since x == 5 is true, x must be 5.

In general, it seems that the question that the optimizer is in general asking is: "Here is a node. Here are some facts I know to be true. What is the most simplified form of every node that this node depends upon?"

Allocation and random numbers

Ideally, the optimizer should be able to remove unused allocations and reorder allocation. However, I don't see a way to have them not in the control flow, as they do have side effects. Two runs of an allocation function should not return the same result, and so allocation cannot be a pure function. The same issue shows up with random numbers - unless the programmer cares about replayability, the compiler should be able to reorder and remove calls to a random number generator. This problem in both these cases is the same: we have a function that is not idempotent (calling it twice returns two different things) yet is not side-effecting. I honestly have no idea how to deal with this.

A libFIRM-inspired framework for optimization

While just translating code from my new language to assembly is cool and useful, what really intrigues me most about compilers is optimization. Ideally, I'd like my compiler to not just produce binaries, but produce fast binaries. Additionally, I'd like to do this in an extensible way, making it easy to write new optimizations and to have them integrate well with the existing passes.

The most powerful and useful way to integrate analyses and transforms that I have seen is the technique of Lerner, Grove, and Chambers, which interleaves various transforms so that each one can benefit from the simplifications made by the rest. Much of my plan for implementing it follows the strategy taken in Hoopl. In more detail, each optimization pass is made up of two functions: an analysis function and a rewrite function. The rewrite function takes the current dataflow state and transforms an instruction into a simpler form. The analysis function just steps the dataflow state forward. To pass information between optimizations, every rewrite is run on an instruction and then al the analyses are done, making all analyses use the maximally simplified instruction.

As an example as to why this is powerful, consider the interaction between constant propagation and constant folding on the following pseudocode:

x := 5
y := x + 2
z := x + y

If constant propagation is run, then this will result in

x := 5
y := 5 + 2
z := 5 + y

which constant folding will reduce to

x := 5
y := 7
z := 5 + y

This is definitely suboptimal, as z should have been optimized to 12. Switching the order of the two passes does not help, actually making the code worse, as constant folding never simplifies anything.

In contrast, when the two optimizations are interleaved, they produce optimal code. Before constant propagation can analyse y := 5 + 2, constant folding rewrites it to y := 7. Constant propagation recognises this new form, substituting to make z := 5 + 7, which is then folded into 12.

Intermediate Representation

While having a way to optimize code is important, it needs a representation of the code that is easy to analyse to be effective. A simple AST of my language would work, but would be not very amenable to optimization and would not be very general. To defer the questions of register allocation and to make optimization opportunities more obvious, I will use an SSA form representation. This has numerous advantages which I will leave out here, but they are substantial enough that most modern optimizers (GCC and LLVM are notable examples) use it.

Beyond just using an SSA form language, I am using an idea inspired by Cliff Click's graph-based intermediate representation, used in Java's HotSpot VM and in libFIRM. Instead of just having linear sequences of instructions, only side-effecting instructions are put into control flow, obviating the need to rearrange instructions carefully. This idea of separating out the pure computations is exactly what makes Haskell so highly optimized - GHC can do many transformations by knowing that order of computation does not matter.

I deviate from Click in one crucial way, however. His representation treats the world as a state machine, having some "world" object that it passed in and out of every operation that needs too modify the environment. GHC has something similar for its IO representation, using State# RealWorld as this object. This has some unfortunate consequences, though. First of all, infinite loops are no longer well formed. These result in a circular data dependency, not tied to any exit nodes which are used to detect dead code. Secondly, there is nothing stopping some badly designed optimization from "duplicating" the world, trying to do I/O and then roll it back with another world object.

To remedy these issues, I use a continuation passing style implementation of the world. Instead of having each operation take a value from its predecessor, each operation holds a continuation pointing at its successor. This prevents any duplication of world state, as an operation can only point to one successor, and it solves the problem of infinite loops by looking at thing reachable from the start, not the end. Also, dead code analysis is made much simpler, as something is live iff it is reachable from the start node. If statements can be just selecting between two values based on a boolean, with the control flow application involving continuations.

Examples

Doing my best with ascii diagrams, here are a few examples of my IR.

x := 13
y := 12
print(x + y)

becomes

[Start]   [Const 13]
   |         ^
   V         |
[Print] -> [Add] -> [Const 12]
   |
   V
[Return]

x := true
if x:
  y := 5
else:
  y := 7
print(y)

becomes

[Start]
   |
   |  >--> [Const true]
   V  |
[ If  ]
 |  |
 |  | >--> [Const 5]
 V  V |
[ Phi ] --> [Const 7]
  |  ^
  V  |
[Print]
   |
   V
[Return]

and is optimized to (assuming x was actually variable)

[Start]
   |
   |    [Const true]
   |       ^
   |       |    >---> [Const 5]
   V       |    |
[Print] -> [ If  ] ---> [Const 7]
   |
   V
[Return]

x := 0
loop forever:
  print(x)
  x := x + 1

becomes

[Start]
  | V-----------------------<
  | | >--> [Const 0]        |
  V V |                     |
[ Phi ]--V  [Const 1]       |
  |  ^   |       ^          |
  |  |   >----V  |          |
  |  +------ [Add]          |
  V  |                      |
[Print] --------------------^

Wednesday, December 17, 2014

Durst (lack of) progress and Rust

Backpropagation, memory, and other optimization troubles

Memory

Backpropagation

Allocation and random numbers

A libFIRM-inspired framework for optimization

Intermediate Representation

Examples