Wednesday, December 17, 2014

Durst (lack of) progress and Rust

Lately, I haven't gotten much quantitatively done on Durst. I've written barely any new code. In part, this is because I keep getting distracted, both with other projects and with every new idea I have for the language - I'm too busy planning Durst to actually make it. This is unfortunate, and I hope I'll get some stuff done over winter break, but for now, I'll talk about the compiler I have been working on: Rust

Rust's compiler is honestly quite ugly, and so I've made a few attempts to clean some of it (specifically the name resolution code) up. However, I've mainly been working on the standard library's data structures. I really enjoy the puzzle of figuring out how to make APIs that are perfectly safe (they make it impossible to, e.g., read uninitialized memory or to use-after-free) yet fully powerful. Recently, I redesigned BTreeMap's internal APIs to make them far safer (and faster).

The core problem is that navigating through a BTree often requires maintaining a stack of the nodes that have been visited to get to the current node. However, any of these nodes could be mutated to invalidate the rest, so Rust won't let us just store a stack of mutable references. As such, there is an internal stack that manages unsafely holding a stack of nodes, only allowing mutation at the top. The issue that arises is figuring out how to push to the stack. If the stack module allowed you to push arbitrary nodes, then it wouldn't work for its designed purpose, namely holding a contiguous path down a BTree. It would be easy to add a node farther down the tree or even from a different tree.

To solve this, had the stack expose the top node with a special 'id lifetime, only accepting a request to push that same node's children. I'm still trying to figure out the best way to explain this, but the PR is here

Backpropagation, memory, and other optimization troubles

In my previous post, I described an intermediate representation that made a number of optimization problems much easier. However, it still has some complications.


Naively, I could just represent memory operations as side effects, modifying the world. However, this is suboptimal for exactly the reasons that I chose to keep operations like addition out of the normal flow - it hinders optimization. As such, I need some other representation of memory. libFIRM takes the approach of having a global "Store" that it passes in and out of every write and into every read. This has a number of large advantages - it naturally expresses that reads can be reordered and that memory operations are independent of input and output. Moreover, if this is extended with a bunch of separate Stores that can be split and merged, this provides a perfect model for aliasing - a Store can be split into two nonaliasing pieces, and operations on one won't affect the other. However, this has the same issue as the state machine representation of control flow - there is nothing preventing all of the memory from being duplicated.

One idea of had to get around this is to make every write to memory produce a read-only Store and to take a "memory continuation" to the next write. This circumvents the issue of duplicated memory, but comes with its own problems. Merging separate stores becomes inexpressible, and I would need a new MemoryPhi node to handle conditionals in the memory pipeline. Given these problems, I am currently leaning towards a state-based approach, but haven't fully decided yet


In a simple list of instructions, it is easy to do the sort of dataflow analysis I described in the previous post. However, with my new IR, it becomes harder, involving a lot of going "backwards". This is because just following the control flow chain will not result in every node being analysed, as many operations are out of the control flow. To remedy this, the optimizer has to be able to take a node whose inputs haven't been analysed, analyse the inputs, and then do another analysis of the use. This is tedious, but it also makes things a bit more consistent - with the switch to continuation-based control flow, analysing later instructions involves analysing the arguments of earlier instructions.

The other place where this sort off backward analysis becomes crucial is in situations like the following:

if x == 5:
  y := x
  y := 5

This should be optimized to y := 5. However, to do this, the optimizer has to know that x is 5 when analysing y := x. This is more difficult than it looks, because it has to extract this information from the result of a comparison. It has to go backward in the tree, realizing that since x == 5 is true, x must be 5.

In general, it seems that the question that the optimizer is in general asking is: "Here is a node. Here are some facts I know to be true. What is the most simplified form of every node that this node depends upon?"

Allocation and random numbers

Ideally, the optimizer should be able to remove unused allocations and reorder allocation. However, I don't see a way to have them not in the control flow, as they do have side effects. Two runs of an allocation function should not return the same result, and so allocation cannot be a pure function. The same issue shows up with random numbers - unless the programmer cares about replayability, the compiler should be able to reorder and remove calls to a random number generator. This problem in both these cases is the same: we have a function that is not idempotent (calling it twice returns two different things) yet is not side-effecting. I honestly have no idea how to deal with this.

A libFIRM-inspired framework for optimization

While just translating code from my new language to assembly is cool and useful, what really intrigues me most about compilers is optimization. Ideally, I'd like my compiler to not just produce binaries, but produce fast binaries. Additionally, I'd like to do this in an extensible way, making it easy to write new optimizations and to have them integrate well with the existing passes.

The most powerful and useful way to integrate analyses and transforms that I have seen is the technique of Lerner, Grove, and Chambers, which interleaves various transforms so that each one can benefit from the simplifications made by the rest. Much of my plan for implementing it follows the strategy taken in Hoopl. In more detail, each optimization pass is made up of two functions: an analysis function and a rewrite function. The rewrite function takes the current dataflow state and transforms an instruction into a simpler form. The analysis function just steps the dataflow state forward. To pass information between optimizations, every rewrite is run on an instruction and then al the analyses are done, making all analyses use the maximally simplified instruction.

As an example as to why this is powerful, consider the interaction between constant propagation and constant folding on the following pseudocode:

x := 5
y := x + 2
z := x + y

If constant propagation is run, then this will result in

x := 5
y := 5 + 2
z := 5 + y

which constant folding will reduce to

x := 5
y := 7
z := 5 + y

This is definitely suboptimal, as z should have been optimized to 12. Switching the order of the two passes does not help, actually making the code worse, as constant folding never simplifies anything.

In contrast, when the two optimizations are interleaved, they produce optimal code. Before constant propagation can analyse y := 5 + 2, constant folding rewrites it to y := 7. Constant propagation recognises this new form, substituting to make z := 5 + 7, which is then folded into 12.

Intermediate Representation

While having a way to optimize code is important, it needs a representation of the code that is easy to analyse to be effective. A simple AST of my language would work, but would be not very amenable to optimization and would not be very general. To defer the questions of register allocation and to make optimization opportunities more obvious, I will use an SSA form representation. This has numerous advantages which I will leave out here, but they are substantial enough that most modern optimizers (GCC and LLVM are notable examples) use it.

Beyond just using an SSA form language, I am using an idea inspired by Cliff Click's graph-based intermediate representation, used in Java's HotSpot VM and in libFIRM. Instead of just having linear sequences of instructions, only side-effecting instructions are put into control flow, obviating the need to rearrange instructions carefully. This idea of separating out the pure computations is exactly what makes Haskell so highly optimized - GHC can do many transformations by knowing that order of computation does not matter.

I deviate from Click in one crucial way, however. His representation treats the world as a state machine, having some "world" object that it passed in and out of every operation that needs too modify the environment. GHC has something similar for its IO representation, using State# RealWorld as this object. This has some unfortunate consequences, though. First of all, infinite loops are no longer well formed. These result in a circular data dependency, not tied to any exit nodes which are used to detect dead code. Secondly, there is nothing stopping some badly designed optimization from "duplicating" the world, trying to do I/O and then roll it back with another world object.

To remedy these issues, I use a continuation passing style implementation of the world. Instead of having each operation take a value from its predecessor, each operation holds a continuation pointing at its successor. This prevents any duplication of world state, as an operation can only point to one successor, and it solves the problem of infinite loops by looking at thing reachable from the start, not the end. Also, dead code analysis is made much simpler, as something is live iff it is reachable from the start node. If statements can be just selecting between two values based on a boolean, with the control flow application involving continuations.


Doing my best with ascii diagrams, here are a few examples of my IR.

x := 13
y := 12
print(x + y)


[Start]   [Const 13]
   |         ^
   V         |
[Print] -> [Add] -> [Const 12]
x := true
if x:
  y := 5
  y := 7


   |  >--> [Const true]
   V  |
[ If  ]
 |  |
 |  | >--> [Const 5]
 V  V |
[ Phi ] --> [Const 7]
  |  ^
  V  |

and is optimized to (assuming x was actually variable)

   |    [Const true]
   |       ^
   |       |    >---> [Const 5]
   V       |    |
[Print] -> [ If  ] ---> [Const 7]
x := 0
loop forever:
  x := x + 1


  | V-----------------------<
  | | >--> [Const 0]        |
  V V |                     |
[ Phi ]--V  [Const 1]       |
  |  ^   |       ^          |
  |  |   >----V  |          |
  |  +------ [Add]          |
  V  |                      |
[Print] --------------------^

Tuesday, September 30, 2014

Durst progress

These last two weeks have been quite slow on the code side of things, primarily because I've spent my time reading up on LLVM and secondarily because I have been less productive. Hopefully, this reasearch will allow me to actually get code generation working. This post is somewhat of a haphazard collection of notes about what I have read.

Unfortunately, the compilation process is very complicated. To create an executable in a scalable way, I must transform the code to LLVM IR, emit an object file, track which object files need to be used, and link them together. The IR creation step seems thankfully quite simple, but object files and linking can be quite confusing. For now, though, I plan on just generating a single object and executable, so I shouldn't have to worry too much.

The bindings to LLVM that I plan on using in the rustc_llvm crate are unstable and badly documented, but they seem to mainly correspond directly to LLVM functions, so I should hopefully be able to just follow the Kaleidoscope tutorial and this implementation of it. As of yet, durst has not control flow, which makes things easier.

Rustc, whose source I've been reading and modifying in hopes of understanding compilers better, is very complex. Unfortunately, I haven't really managed to get much about compiler design yet, since I'm having trouble even getting started with the whole compilation system.

Hopefully, I'll actually have significant code changes to talk about next week.

Monday, September 15, 2014

Durst progress

My new language is moving along slowly but steadily. The parser is now complete, and though I am tempted to rewrite it after reading about Marpa parsing, this would probably be counter-productive. Adding an interpreter was fairly simple, but this is running on an expression AST, not any sort of CFG.

The current plan is to first extend the language with functions, including some basic IO, then move on to compilation. Compilation will be done by first converting to an SSA CFG similar that that of LLVM, then converting to LLVM and letting it do the hard work of actually producing machine code.

Coming up with a good representation for the CFG is a bit difficult, because I want to have want operation to point to its operands, but naively done this is impossible in Rust, which forbids cycles of pointers. As such, I plan to create an (internally unsafe) module for a CFG using raw pointers, and use that.

Once I have basic compilation working, the plan is to play around with optimization by creating a framework similar to Hoopl. I have a number of ideas of how to improve upon that work, including interleaving the backwards and forwards passes and immediately applying transformations that are known to be valid.

Tuesday, September 2, 2014

Durst: Creating a new language in Rust

I've started a new major project, with high and lofty goals that will probably never be achieved. I am creating my own programming language, with the goal to be similar to Rust in many aspects, but ensuring greater safety using dependent types.

So far, I've built a basic compiler executable that calls into a parsing library that I wrote. The executable is fairly simple, but the parsing library was interesting, as Rust gave me goals and restrictions that weren't present in other parsing libraries I've written.

First, I wanted everything to be allocated on the stack - this allows for a compact representation and allows the compiler to perform more optimizations. This was actually pretty easy, as Rust has very simple and easy to use structs and doesn't box anything without explicit requests.

Second, I wanted the parsing framework to support recursion. This was where I started to run into problems, because Rust requires all memory to be initialized. As such, it is impossible to easily create cyclic data structures. Worse, even if I were to use something like RefCell to get around these restrictions, cycles just don't fit well into Rust's ownership model - it is hard to ask who owns whom and to prove that backreferences will only live as long as the forward references.

To get around this problem, I simply required users of recursion to actually create their own implementation of the Parser trait, meaning that they can just create the cyclic parsers on the fly.

By next week, I hope to have a full AST defined and parsing, so that I can move on to interpretation.

Saturday, January 25, 2014

Extending Array Recycling With Delayed Arrays For Index Space Transformations

First, I must apologize. My previous post said that I would write a sequel, explaining my solution to the given problem. However, I didn't, and have since forgotten all my thoughts on the subject. As such, I will not be writing a part two.

With that over with, I get to the subject of this post. The vector library has a very sophisticated fusion system, allowing it to evaluate expressions such as map (+1) (filter p xs // us) using only one new array, instead of the 3 it seems to use. However, there are a large class of operations it cannot optimize properly, namely index space transformations like reverse, backpermute, or even index. By stealing an idea from repa, these functions can be cleanly fused without any change to user code.

The Current Fusion System

The system vector uses for fusion, described in the paper Recycle Your Arrays! by Roman Leshchinskiy (like the stream fusion paper, this is very readable), is based on two intermediate representations that describe how to build the array instead of the actual elements: the Stream describes the array as something akin to Python's generators - it is a loop (stored in a special decomposed format to help the compiler) that can yield elements as it runs, while the New type is a computation (in the ST monad) that allocates and fills a mutable vector. Each of these has their respective advantages and disadvantages - the Stream allows for polymorphic transformations, arbitrary changes to the length, and produces a single tight loop filling the array, and New has random access and update but cannot grow in size or change type.

In Haskell, these types have the following definitions:

data Stream a = forall s . Stream (s -> Step s a) s Size
data Step s a = Yield a s | Skip s | Done

newtype New a = New (forall s . ST s (MVector s a))

To avoid doing multiple passes over the data, the loop in a Stream is represented as an initial state and a state transformation function that corresponds to a single iteration of the loop. Note that the loop is not forced to yield a value on every iteration, allowing the efficient implementation of functions like filter.

Conversion between these representations is achieved by a triplet of functions that go through the three representations in a cyclic fashion:

stream :: Vector a -> Stream a
fill   :: Stream a -> New a
new    :: New a    -> Vector a

Fusion can then be achieved by remove entire cycles, like so:

    "fusion" forall s . stream (new (fill s)) = s
    "recycling" forall n . fill (stream (new n)) = n

As an example of how this works, here is the reduction of map f xs // us:

map f xs // us
  = {- inlining -}
new (fill (mapS f (stream xs))) // us
  = {- inlining -}
new (updateN (fill (stream (new (fill (mapS f (stream xs)))))) us)
  = {- fusion or recycling -}
new (updateN (fill (mapS f (stream xs))) us)

The result only has one new and so only allocates a single vector

The Current Handling of Index Space Transformations

Let us take, for example, the expression reverse (reverse xs). Clearly, this should be optimized into at worst a single copy and at best a noop. However, in the current system, two reversals take place, and two arrays are allocated. This unfortunate fact is documented by a comment in the vector source:

-- FIXME: make this fuse better, add support for recycling

So why does this happen? The fundamental problem is that Streams cannot, in general, be reversed - this would require running a loop backwards. The best the library can do is stream the original vector in reverse - after all, vectors have random access. In fact, this is how reverse is implemented:

reverse = unstream . streamR

where unstream = new . fill and streamR does the aforementioned backwards streaming.

Solving the Problem With Delayed Arrays

repa, more so than vector alone, has an large number of these index space transformations. As such, the original repa paper contains a solution - delayed arrays. The basic idea is that an array can be completely described by its size and its indexing function. This representation is perfect for our index space transformations, as they now are simply function composition. Now, despite this optimization still being in repa today, it has a fundamental problem: the decision of where to use delayed arrays and where to use manifest arrays is forced upon the user. Instead of following this path, let us try to integrate it into the existing recycling framework.

The first step is to create the new representation and corresponding functions:

data Delayed a = Delayed {-# UNPACK #-} !Int (Int -> a)

delay :: Vector a -> Delayed a
delay v = Delayed (length v) (unsafeIndex v)

Now, where in the cycle of representations should the delayed arrays be? How easy it was to convert from a Vector to Delayed implies that it should be between Vector and Stream. In fact, writing the new stream function is surprisingly easy - previously, it was defined as follows:

stream v = v `seq` n `seq` (Stream.unfoldr get 0 `Stream.sized` Exact n)
    n = length v

    {-# INLINE get #-}
    get i | i >= n    = Nothing
          | otherwise = case basicUnsafeIndexM v i of Box x -> Just (x, i+1)

The changed code is actually simpler:

stream (Delayed n ix) = Stream.unfoldr get 0 `Stream.sized` Exact n
    {-# INLINE get #-}
    get i | i >= n    = Nothing
          | otherwise = Just (ix i, i+1)

With that in place, we can finish the framework by updating our rewrite rules:

    "fusion" forall s . stream (delay (new (fill s))) = s
    "recycling" forall n . fill (stream (delay (new n))) = n
    "delaying" forall d . delay (new (fill (stream d))) = d

The last thing to do is implement the new functions:

reverseD :: Delayed a -> Delayed a
reverseD (Delayed n ix) = Delayed n (\i -> ix (n - 1 - i))

reverse :: Vector a -> Vector a
reverse = new . fill . stream . reverseD . delay

-- I don't do bound checks for clarity
indexD :: Delayed a -> Int -> a
indexD (Delayed _ ix) i = ix i

(!) :: Vector a -> Int -> a
v ! i = indexD (delay v) i

Choosing Which Representation to Work On

The change described above, keeping the definitions of all functions that aren't index space transformations the same, has strictly better performance than the current vector. However, this isn't enough. Consider the reduction of reverse (map f xs):

reverse (map f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (mapS f (stream (delay xs)))))))))

Because no full cycle is present, no rules can fire, and an intermediate array is allocated. This is suboptimal, however, as map can also work on delayed arrays:

mapD :: (a -> b) -> Delayed a -> Delayed b
mapD (Delayed n ix) = Delayed n (f . ix)

reverse (map f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (stream (mapD f (delay xs)))))))))
  = {- delaying -}
new (fill (stream (reverseD (mapD f (delay xs)))))

Switching to the delayed representation doesn't work in all cases either. When map is fed a stream, it should ideally run mapS, as using a delayed representation would require an allocation. These problems plague many other functions that similarly can work on multiple representations, such as append, filter, and even reverse, which can operate on New as well as Delayed arrays.

As it turns out, the array recycling paper saw this problem and came up with a solution. Unfortunately, it is very specialized: it only works on single argument Stream-transforming functions that don't increase the size of the array. This works, for example, on filter or monomorphic uses of map, but excludes append and reverse. In addition, it only works for converting between Stream and New. For the general case, we need a new solution:

  "reverseN/fill..." forall f d . reverseN (fill (stream d)) = fill (stream (reverseD d))
  "reverseD/delay..." forall f n . reverseD (delay (new n)) = delay (new (reverseN n))

  "appendS/stream" forall f d1 d2 . appendS (stream d1) (stream d2) = stream (appendD d1 d2)
  "appendD/delay.../1" forall f s1 d2 . appendD (delay (new (fill s1))) d2 = delay (new (fill (appendS s1 (stream d2))))
  "appendD/delay.../2" forall f d1 s2 . appendD d1 (delay (new (fill s2))) = delay (new (fill (appendS (stream d1) s2)))

  "mapS/stream" forall f d . mapS f (stream d) = stream (mapD f d)
  "mapD/delay..." forall f s . mapD f (delay (new (fill s))) = delay (new (fill (mapS f s)))
  "mapD/delay... (monomorphic)" forall f n . mapD f (delay (new n)) = delay (new (transform (mapS f) n))
  "transform/fill" forall f s . transform f (fill s) = fill (f s)
-- and so on

The above code demonstrates three different parts of the system. In the reverse case, we have a simple single argument function that can operate uniformly on multiple representations. As such, all we do is try to move the call inward, or rightward. This moves as many representation-changing functions as possible to the outside of the expression where, all clumped together, they form cycles and can be removed. For example,

reverse (filter f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (filterS f (stream (delay xs)))))))))
  = {- reverseD/delay... -}
new (fill (stream (delay (new (reverseN (fill (filterS f (stream (delay xs)))))))))
  = {- recycling -}
new (reverseN (fill (filterS f (stream (delay xs)))))

The append case follows the same pattern, but it is slightly different in its handling of new. Because new is the only representation changing function that actually does work, adding in an extra stream to have a higher chance of fusing away the allocation is perfectly acceptable. As such, we aggressively move allocations outward even if only one is available. This change also makes the rewrite system confluent. Note that we can't do the same aggressive movement with the other representation changing functions, as we would have to introduce new allocations. In action,

append (filter f xs) (reverse ys)
  = {- inlining -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (delay (new (fill (stream (reverseD (delay ys)))))))))
  = {- delaying -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (reverseD (delay ys)))))
  = {- appendD/delay.../1 -}
new (fill (stream (delay (new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))))))
  = {- recycling -}
new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))

map is an interesting function because it can work on Delayed arrays, Streams, and New, but only if the function preserves types. Because of this, we have two cases: polymorphic and monomorphic. In the polymorphic case, the mapS/stream and mapD/delay... rules cycle through Stream and Delayed. In the monomorphic case, however, there is a different cycle, formed by mapS/stream, mapD/delay... (monomorphic), and transform/fill.

The second phase: fixing smaller inefficiencies

As it turns out, this system is provably optimal in the number of allocations. Unfortunately, allocations aren't the only thing determining performance. A system very similar to the one described above was rejected because of this. It describes the following example:

map (> 5) (map (+1) (xs // ps))
  = {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
  = {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
  = {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))

Although the result has the optimal two allocations, the two maps are executed in separate loops and so cannot be properly fused. To rectify this, I propose a two-phase system. In phase 0, the previously described system is run, eliminating all unnecessary allocations. In phase 1, we do a number of "fixup" transformations that undo some of the inward movement done in phase 0, like so:

  "stream/mapD" [1] forall f d . stream (mapD f d) = mapS f (stream d)
  "delay.../mapS" [1] forall f s . delay (new (fill (mapS f s))) = mapD f (delay (new (fill s)))

  "stream.../transform" [1] forall f n . stream (delay (new (transform f n))) = f (stream (delay (new n)))

  "delay.../reverseN" [1] forall n . delay (new (reverseN n)) = reverseD (delay (new n))

  "stream/appendD" [1] forall d1 d2 . stream (appendD d1 d2) = appendS (stream d1) (stream d2)

Note that map, which can act equally efficiently on Delayed arrays and Streams, simply moves outward, switching between the two representations in an effort to get out of the way of other transformations. However, reverse runs much more efficiently on Delayed arrays than on New, so only switches away from New. Similarly, append uses Streams and even monomorphic uses of map try to avoid New. To see this working:

map (> 5) (map (+1) (xs // ps))
  = {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
  = {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
  = {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
  = {- entering phase 1, stream/mapD -}
new (fill (mapD (> 5) (stream (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
  = {- stream.../transform -}
new (fill (mapD (> 5) (mapS (+1) (stream (delay (new (update (fill (stream (delay xs))) ps))))))

Future Work

While this system works fairly well, there are a number of ways in which it could be improved.

More representations

While delayed arrays help fusion significantly, other representations could also be useful. For example, the expression reverse (filter f xs) is currently compiled to new (reverseN (fill (filterS f (stream (delay xs))))), which allocates a filtered array then reverses it in place. Ideally, the system would simply write out the array in reverse, requiring no post-processing step. This could be accomplished with a representation similar to Stream (Int, a) which would specify not only what the elements were but where to write them.

Multiple cycles of representations

The aforementioned reverse filling only works if the number of elements has a bound known before calculation. However, due to functions like concatMap, this is not necessarily true. To fix this, there would have to be a static distinction between the two types of Streams. This would cause there to be two different ways of getting from a Delayed array to New, and so there would be multiple possible cycles of representations.

Commutivity optimizations

The current system of fusion puts functions in a strict order, conservatively assuming that each one could do almost anything, and so cannot be reordered. However, this misses out on a lot of opportunities. For example, filter f (reverse (filter g xs)) compiles to

new (fill (filterS f (stream (reverseD (delay (new (fill (filterS g (stream (delay xs))))))))))
instead of to the much more efficient
new (fill (filterS f (filterS g (stream (reverseD (delay xs))))))

Proof of allocation optimality

For those of you who are interested, here is my proof that the simple inward moving strategy produces the optimal number of allocations.

First of all, we define an expression to be a finite rooted tree whose internal verticies are functions, associated with one of a finite set of representations. The leaves and the root are special, abstract nodes that are all associated with the same representation, what we will call the manifest representation. The root has exactly one child. Note that this disallows functions that return multiple results or take different representations as arguments.

Now, assume there is some total order on the set of representations, with the manifest representation being the minimum. We say that an edge of an expression is an allocating edge if the child node's representation is greater than the parent node's representation. We say that the cost of an expression is the number of allocating edges in that expression.

Additionally, every function is associated with a fixed, nonempty set of allowed representations. We say that an expression is valid if every function is associated with one of its allowed representations.

Next we say that the inward optimization is the transformation from expressions to expressions that, at every node, sets its representation to the smallest allowed representation that is greater than or equal to the representations of its children, discounting the children that are larger than all allowed representations. This is done in a bottom up fashion, starting with the nodes just above the leaves, thereby making this transformation independent of the previous representations and so idempotent. Note that this is equivalent to the set of rewrite rules described above, as the rules that don't deal with allocations simply lower the representation of a node if the new representation is greater than or equal to that of all the children, and the aggressive rules that do deal with allocations simply raise the representation to the largest one allowed if a large child is detected.

Lemma 1: The inward optimization returns a valid expression

Because the inward optimization, by definition, assigns a representation to every function that is one of its allowed representations, it returns a valid expression.

Lemma 2: The inward optimization does not increase the cost of an expression if the input was valid

It suffices to prove that transforming a single node does not increase the cost of the expression, as the inward optimization simply transforms the tree one node at a time. Now, we can ignore the edges to the children that have a representation larger than all the allowed representation, as all valid assignments must make those edges allocating, and this optimization does not change that. The inward optimization sets the remaining edges to not be allocating, as, by definition, it picks a representation greater than or equal to the children's representations. Therefore, the only way for the inward transformation to increase the cost would be switching the parent edge to be allocating where previously no edges were allocating. However, if this was the input state, then there exists some allowed representation greater than or equal to those of the children and less than or equal to that of the parent, and the minimum allowed representation greater than or equal to those of the children, which the inward optimization picks, would satisfy this property.

Alternatively, given the correspondence with the rewrite rules given earlier in this post and the fact that none of those rewrite rules increase the number of allocations, the repeated use of those rules and so the inward optimization doesn't increase the number of allocations.

The final proof: The inward optimization produces a minimum cost expression

Consider a valid assignment of representations with minimum cost. Applying the inward optimization to it cannot increase the cost, and as we started with a minimum cost expression, it cannot decrease the cost. Therefore, the result must also be minimum cost. However, the inward optimization is independent of the starting representations, so applying the inward optimization to any tree with the same functions produces a minimum cost expression.