First, I must apologize. My previous post said that I would write a sequel, explaining my solution to the given problem. However, I didn't, and have since forgotten all my thoughts on the subject. As such, I will not be writing a part two.

With that over with, I get to the subject of this post. The `vector`

library has a very sophisticated fusion system, allowing it to evaluate expressions such as `map (+1) (filter p xs // us)`

using only one new array, instead of the 3 it seems to use. However, there are a large class of operations it cannot optimize properly, namely index space transformations like `reverse`

, `backpermute`

, or even `index`

. By stealing an idea from `repa`

, these functions can be cleanly fused without any change to user code.

## The Current Fusion System

The system `vector`

uses for fusion, described in the paper Recycle Your Arrays! by Roman Leshchinskiy (like the stream fusion paper, this is very readable), is based on two intermediate representations that describe how to build the array instead of the actual elements: the `Stream`

describes the array as something akin to Python's generators - it is a loop (stored in a special decomposed format to help the compiler) that can yield elements as it runs, while the `New`

type is a computation (in the `ST`

monad) that allocates and fills a mutable vector. Each of these has their respective advantages and disadvantages - the `Stream`

allows for polymorphic transformations, arbitrary changes to the length, and produces a single tight loop filling the array, and `New`

has random access and update but cannot grow in size or change type.

In Haskell, these types have the following definitions:

data Stream a = forall s . Stream (s -> Step s a) s Size
data Step s a = Yield a s | Skip s | Done
newtype New a = New (forall s . ST s (MVector s a))

To avoid doing multiple passes over the data, the loop in a `Stream`

is represented as an initial state and a state transformation function that corresponds to a single iteration of the loop. Note that the loop is not forced to yield a value on every iteration, allowing the efficient implementation of functions like `filter`

.

Conversion between these representations is achieved by a triplet of functions that go through the three representations in a cyclic fashion:

stream :: Vector a -> Stream a
fill :: Stream a -> New a
new :: New a -> Vector a

Fusion can then be achieved by remove entire cycles, like so:

{-# RULES
"fusion" forall s . stream (new (fill s)) = s
"recycling" forall n . fill (stream (new n)) = n
#-}

As an example of how this works, here is the reduction of `map f xs // us`

:

map f xs // us
= {- inlining -}
new (fill (mapS f (stream xs))) // us
= {- inlining -}
new (updateN (fill (stream (new (fill (mapS f (stream xs)))))) us)
= {- fusion or recycling -}
new (updateN (fill (mapS f (stream xs))) us)

The result only has one `new`

and so only allocates a single vector

## The Current Handling of Index Space Transformations

Let us take, for example, the expression `reverse (reverse xs)`

. Clearly, this should be optimized into at worst a single copy and at best a noop. However, in the current system, two reversals take place, and two arrays are allocated. This unfortunate fact is documented by a comment in the `vector`

source:

-- FIXME: make this fuse better, add support for recycling

So why does this happen? The fundamental problem is that `Stream`

s cannot, in general, be reversed - this would require running a loop backwards. The best the library can do is stream the original vector in reverse - after all, vectors have random access. In fact, this is how `reverse`

is implemented:

reverse = unstream . streamR

where `unstream = new . fill`

and `streamR`

does the aforementioned backwards streaming.

## Solving the Problem With Delayed Arrays

`repa`

, more so than `vector`

alone, has an large number of these index space transformations. As such, the original `repa`

paper contains a solution - delayed arrays. The basic idea is that an array can be completely described by its size and its indexing function. This representation is perfect for our index space transformations, as they now are simply function composition. Now, despite this optimization still being in `repa`

today, it has a fundamental problem: the decision of where to use delayed arrays and where to use manifest arrays is forced upon the user. Instead of following this path, let us try to integrate it into the existing recycling framework.

The first step is to create the new representation and corresponding functions:

data Delayed a = Delayed {-# UNPACK #-} !Int (Int -> a)
delay :: Vector a -> Delayed a
delay v = Delayed (length v) (unsafeIndex v)

Now, where in the cycle of representations should the delayed arrays be? How easy it was to convert from a `Vector`

to `Delayed`

implies that it should be between `Vector`

and `Stream`

. In fact, writing the new `stream`

function is surprisingly easy - previously, it was defined as follows:

stream v = v `seq` n `seq` (Stream.unfoldr get 0 `Stream.sized` Exact n)
where
n = length v
{-# INLINE get #-}
get i | i >= n = Nothing
| otherwise = case basicUnsafeIndexM v i of Box x -> Just (x, i+1)

The changed code is actually simpler:

stream (Delayed n ix) = Stream.unfoldr get 0 `Stream.sized` Exact n
where
{-# INLINE get #-}
get i | i >= n = Nothing
| otherwise = Just (ix i, i+1)

With that in place, we can finish the framework by updating our rewrite rules:

{-# RULES
"fusion" forall s . stream (delay (new (fill s))) = s
"recycling" forall n . fill (stream (delay (new n))) = n
"delaying" forall d . delay (new (fill (stream d))) = d
#-}

The last thing to do is implement the new functions:

reverseD :: Delayed a -> Delayed a
reverseD (Delayed n ix) = Delayed n (\i -> ix (n - 1 - i))
reverse :: Vector a -> Vector a
reverse = new . fill . stream . reverseD . delay
-- I don't do bound checks for clarity
indexD :: Delayed a -> Int -> a
indexD (Delayed _ ix) i = ix i
(!) :: Vector a -> Int -> a
v ! i = indexD (delay v) i

## Choosing Which Representation to Work On

The change described above, keeping the definitions of all functions that aren't index space transformations the same, has strictly better performance than the current `vector`

. However, this isn't enough. Consider the reduction of `reverse (map f xs)`

:

```
reverse (map f xs)
= {- inlining -}
new (fill (stream (reverseD (delay (new (fill (mapS f (stream (delay xs)))))))))
```

Because no full cycle is present, no rules can fire, and an intermediate array is allocated. This is suboptimal, however, as `map`

can also work on delayed arrays:

mapD :: (a -> b) -> Delayed a -> Delayed b
mapD (Delayed n ix) = Delayed n (f . ix)
reverse (map f xs)
= {- inlining -}
new (fill (stream (reverseD (delay (new (fill (stream (mapD f (delay xs)))))))))
= {- delaying -}
new (fill (stream (reverseD (mapD f (delay xs)))))

Switching to the delayed representation doesn't work in all cases either. When `map`

is fed a stream, it should ideally run `mapS`

, as using a delayed representation would require an allocation. These problems plague many other functions that similarly can work on multiple representations, such as `append`

, `filter`

, and even `reverse`

, which can operate on `New`

as well as `Delayed`

arrays.

As it turns out, the array recycling paper saw this problem and came up with a solution. Unfortunately, it is very specialized: it only works on single argument `Stream`

-transforming functions that don't increase the size of the array. This works, for example, on `filter`

or monomorphic uses of `map`

, but excludes `append`

and `reverse`

. In addition, it only works for converting between `Stream`

and `New`

. For the general case, we need a new solution:

{-# RULES
"reverseN/fill..." forall f d . reverseN (fill (stream d)) = fill (stream (reverseD d))
"reverseD/delay..." forall f n . reverseD (delay (new n)) = delay (new (reverseN n))
"appendS/stream" forall f d1 d2 . appendS (stream d1) (stream d2) = stream (appendD d1 d2)
"appendD/delay.../1" forall f s1 d2 . appendD (delay (new (fill s1))) d2 = delay (new (fill (appendS s1 (stream d2))))
"appendD/delay.../2" forall f d1 s2 . appendD d1 (delay (new (fill s2))) = delay (new (fill (appendS (stream d1) s2)))
"mapS/stream" forall f d . mapS f (stream d) = stream (mapD f d)
"mapD/delay..." forall f s . mapD f (delay (new (fill s))) = delay (new (fill (mapS f s)))
"mapD/delay... (monomorphic)" forall f n . mapD f (delay (new n)) = delay (new (transform (mapS f) n))
"transform/fill" forall f s . transform f (fill s) = fill (f s)
#-}
-- and so on

The above code demonstrates three different parts of the system. In the `reverse`

case, we have a simple single argument function that can operate uniformly on multiple representations. As such, all we do is try to move the call inward, or rightward. This moves as many representation-changing functions as possible to the outside of the expression where, all clumped together, they form cycles and can be removed. For example,

reverse (filter f xs)
= {- inlining -}
new (fill (stream (reverseD (delay (new (fill (filterS f (stream (delay xs)))))))))
= {- reverseD/delay... -}
new (fill (stream (delay (new (reverseN (fill (filterS f (stream (delay xs)))))))))
= {- recycling -}
new (reverseN (fill (filterS f (stream (delay xs)))))

The `append`

case follows the same pattern, but it is slightly different in its handling of `new`

. Because `new`

is the only representation changing function that actually does work, adding in an extra `stream`

to have a higher chance of fusing away the allocation is perfectly acceptable. As such, we aggressively move allocations outward even if only one is available. This change also makes the rewrite system confluent. Note that we can't do the same aggressive movement with the other representation changing functions, as we would have to introduce new allocations. In action,

append (filter f xs) (reverse ys)
= {- inlining -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (delay (new (fill (stream (reverseD (delay ys)))))))))
= {- delaying -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (reverseD (delay ys)))))
= {- appendD/delay.../1 -}
new (fill (stream (delay (new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))))))
= {- recycling -}
new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))

`map`

is an interesting function because it can work on `Delayed`

arrays, `Stream`

s, and `New`

`, but only if the function preserves types. Because of this, we have two cases: polymorphic and monomorphic. In the polymorphic case, the `

`mapS/stream`

and `mapD/delay...`

rules cycle through `Stream`

and `Delayed`

. In the monomorphic case, however, there is a different cycle, formed by `mapS/stream`

, `mapD/delay... (monomorphic)`

, and `transform/fill`

.

```
```## The second phase: fixing smaller inefficiencies

As it turns out, this system is provably optimal in the number of allocations. Unfortunately, allocations aren't the only thing determining performance. A system very similar to the one described above was rejected because of this. It describes the following example:

map (> 5) (map (+1) (xs // ps))
= {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
= {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
= {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))

Although the result has the optimal two allocations, the two `map`

s are executed in separate loops and so cannot be properly fused. To rectify this, I propose a two-phase system. In phase 0, the previously described system is run, eliminating all unnecessary allocations. In phase 1, we do a number of "fixup" transformations that undo some of the inward movement done in phase 0, like so:

{-# RULES
"stream/mapD" [1] forall f d . stream (mapD f d) = mapS f (stream d)
"delay.../mapS" [1] forall f s . delay (new (fill (mapS f s))) = mapD f (delay (new (fill s)))
"stream.../transform" [1] forall f n . stream (delay (new (transform f n))) = f (stream (delay (new n)))
"delay.../reverseN" [1] forall n . delay (new (reverseN n)) = reverseD (delay (new n))
"stream/appendD" [1] forall d1 d2 . stream (appendD d1 d2) = appendS (stream d1) (stream d2)

Note that `map`

, which can act equally efficiently on `Delayed`

arrays and `Stream`

s, simply moves outward, switching between the two representations in an effort to get out of the way of other transformations. However, `reverse`

runs much more efficiently on `Delayed`

arrays than on `New`

, so only switches away from `New`

. Similarly, `append`

uses `Stream`

s and even monomorphic uses of `map`

try to avoid `New`

. To see this working:

map (> 5) (map (+1) (xs // ps))
= {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
= {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
= {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
= {- entering phase 1, stream/mapD -}
new (fill (mapD (> 5) (stream (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
= {- stream.../transform -}
new (fill (mapD (> 5) (mapS (+1) (stream (delay (new (update (fill (stream (delay xs))) ps))))))

## Future Work

While this system works fairly well, there are a number of ways in which it could be improved.

### More representations

While delayed arrays help fusion significantly, other representations could also be useful. For example, the expression `reverse (filter f xs)`

is currently compiled to `new (reverseN (fill (filterS f (stream (delay xs)))))`

, which allocates a filtered array then reverses it in place. Ideally, the system would simply write out the array in reverse, requiring no post-processing step. This could be accomplished with a representation similar to `Stream (Int, a)`

which would specify not only what the elements were but where to write them.

### Multiple cycles of representations

The aforementioned reverse filling only works if the number of elements has a bound known before calculation. However, due to functions like `concatMap`

, this is not necessarily true. To fix this, there would have to be a static distinction between the two types of `Stream`

s. This would cause there to be two different ways of getting from a `Delayed`

array to `New`

, and so there would be multiple possible cycles of representations.

### Commutivity optimizations

The current system of fusion puts functions in a strict order, conservatively assuming that each one could do almost anything, and so cannot be reordered. However, this misses out on a lot of opportunities. For example, `filter f (reverse (filter g xs))`

compiles to

new (fill (filterS f (stream (reverseD (delay (new (fill (filterS g (stream (delay xs))))))))))

instead of to the much more efficient
new (fill (filterS f (filterS g (stream (reverseD (delay xs))))))

## Proof of allocation optimality

For those of you who are interested, here is my proof that the simple inward moving strategy produces the optimal number of allocations.

First of all, we define an expression to be a finite rooted tree whose internal verticies are functions, associated with one of a finite set of representations. The leaves and the root are special, abstract nodes that are all associated with the same representation, what we will call the manifest representation. The root has exactly one child. Note that this disallows functions that return multiple results or take different representations as arguments.

Now, assume there is some total order on the set of representations, with the manifest representation being the minimum. We say that an edge of an expression is an allocating edge if the child node's representation is greater than the parent node's representation. We say that the cost of an expression is the number of allocating edges in that expression.

Additionally, every function is associated with a fixed, nonempty set of allowed representations. We say that an expression is valid if every function is associated with one of its allowed representations.

Next we say that the inward optimization is the transformation from expressions to expressions that, at every node, sets its representation to the smallest allowed representation that is greater than or equal to the representations of its children, discounting the children that are larger than all allowed representations. This is done in a bottom up fashion, starting with the nodes just above the leaves, thereby making this transformation independent of the previous representations and so idempotent. Note that this is equivalent to the set of rewrite rules described above, as the rules that don't deal with allocations simply lower the representation of a node if the new representation is greater than or equal to that of all the children, and the aggressive rules that do deal with allocations simply raise the representation to the largest one allowed if a large child is detected.

### Lemma 1: The inward optimization returns a valid expression

Because the inward optimization, by definition, assigns a representation to every function that is one of its allowed representations, it returns a valid expression.

### Lemma 2: The inward optimization does not increase the cost of an expression if the input was valid

It suffices to prove that transforming a single node does not increase the cost of the expression, as the inward optimization simply transforms the tree one node at a time. Now, we can ignore the edges to the children that have a representation larger than all the allowed representation, as all valid assignments must make those edges allocating, and this optimization does not change that. The inward optimization sets the remaining edges to not be allocating, as, by definition, it picks a representation greater than or equal to the children's representations. Therefore, the only way for the inward transformation to increase the cost would be switching the parent edge to be allocating where previously no edges were allocating. However, if this was the input state, then there exists some allowed representation greater than or equal to those of the children and less than or equal to that of the parent, and the minimum allowed representation greater than or equal to those of the children, which the inward optimization picks, would satisfy this property.

Alternatively, given the correspondence with the rewrite rules given earlier in this post and the fact that none of those rewrite rules increase the number of allocations, the repeated use of those rules and so the inward optimization doesn't increase the number of allocations.

### The final proof: The inward optimization produces a minimum cost expression

Consider a valid assignment of representations with minimum cost. Applying the inward optimization to it cannot increase the cost, and as we started with a minimum cost expression, it cannot decrease the cost. Therefore, the result must also be minimum cost. However, the inward optimization is independent of the starting representations, so applying the inward optimization to any tree with the same functions produces a minimum cost expression.

## No comments:

## Post a Comment