Saturday, January 25, 2014

Extending Array Recycling With Delayed Arrays For Index Space Transformations

First, I must apologize. My previous post said that I would write a sequel, explaining my solution to the given problem. However, I didn't, and have since forgotten all my thoughts on the subject. As such, I will not be writing a part two.

With that over with, I get to the subject of this post. The vector library has a very sophisticated fusion system, allowing it to evaluate expressions such as map (+1) (filter p xs // us) using only one new array, instead of the 3 it seems to use. However, there are a large class of operations it cannot optimize properly, namely index space transformations like reverse, backpermute, or even index. By stealing an idea from repa, these functions can be cleanly fused without any change to user code.

The Current Fusion System

The system vector uses for fusion, described in the paper Recycle Your Arrays! by Roman Leshchinskiy (like the stream fusion paper, this is very readable), is based on two intermediate representations that describe how to build the array instead of the actual elements: the Stream describes the array as something akin to Python's generators - it is a loop (stored in a special decomposed format to help the compiler) that can yield elements as it runs, while the New type is a computation (in the ST monad) that allocates and fills a mutable vector. Each of these has their respective advantages and disadvantages - the Stream allows for polymorphic transformations, arbitrary changes to the length, and produces a single tight loop filling the array, and New has random access and update but cannot grow in size or change type.

In Haskell, these types have the following definitions:

data Stream a = forall s . Stream (s -> Step s a) s Size
data Step s a = Yield a s | Skip s | Done

newtype New a = New (forall s . ST s (MVector s a))

To avoid doing multiple passes over the data, the loop in a Stream is represented as an initial state and a state transformation function that corresponds to a single iteration of the loop. Note that the loop is not forced to yield a value on every iteration, allowing the efficient implementation of functions like filter.

Conversion between these representations is achieved by a triplet of functions that go through the three representations in a cyclic fashion:

stream :: Vector a -> Stream a
fill   :: Stream a -> New a
new    :: New a    -> Vector a

Fusion can then be achieved by remove entire cycles, like so:

    "fusion" forall s . stream (new (fill s)) = s
    "recycling" forall n . fill (stream (new n)) = n

As an example of how this works, here is the reduction of map f xs // us:

map f xs // us
  = {- inlining -}
new (fill (mapS f (stream xs))) // us
  = {- inlining -}
new (updateN (fill (stream (new (fill (mapS f (stream xs)))))) us)
  = {- fusion or recycling -}
new (updateN (fill (mapS f (stream xs))) us)

The result only has one new and so only allocates a single vector

The Current Handling of Index Space Transformations

Let us take, for example, the expression reverse (reverse xs). Clearly, this should be optimized into at worst a single copy and at best a noop. However, in the current system, two reversals take place, and two arrays are allocated. This unfortunate fact is documented by a comment in the vector source:

-- FIXME: make this fuse better, add support for recycling

So why does this happen? The fundamental problem is that Streams cannot, in general, be reversed - this would require running a loop backwards. The best the library can do is stream the original vector in reverse - after all, vectors have random access. In fact, this is how reverse is implemented:

reverse = unstream . streamR

where unstream = new . fill and streamR does the aforementioned backwards streaming.

Solving the Problem With Delayed Arrays

repa, more so than vector alone, has an large number of these index space transformations. As such, the original repa paper contains a solution - delayed arrays. The basic idea is that an array can be completely described by its size and its indexing function. This representation is perfect for our index space transformations, as they now are simply function composition. Now, despite this optimization still being in repa today, it has a fundamental problem: the decision of where to use delayed arrays and where to use manifest arrays is forced upon the user. Instead of following this path, let us try to integrate it into the existing recycling framework.

The first step is to create the new representation and corresponding functions:

data Delayed a = Delayed {-# UNPACK #-} !Int (Int -> a)

delay :: Vector a -> Delayed a
delay v = Delayed (length v) (unsafeIndex v)

Now, where in the cycle of representations should the delayed arrays be? How easy it was to convert from a Vector to Delayed implies that it should be between Vector and Stream. In fact, writing the new stream function is surprisingly easy - previously, it was defined as follows:

stream v = v `seq` n `seq` (Stream.unfoldr get 0 `Stream.sized` Exact n)
    n = length v

    {-# INLINE get #-}
    get i | i >= n    = Nothing
          | otherwise = case basicUnsafeIndexM v i of Box x -> Just (x, i+1)

The changed code is actually simpler:

stream (Delayed n ix) = Stream.unfoldr get 0 `Stream.sized` Exact n
    {-# INLINE get #-}
    get i | i >= n    = Nothing
          | otherwise = Just (ix i, i+1)

With that in place, we can finish the framework by updating our rewrite rules:

    "fusion" forall s . stream (delay (new (fill s))) = s
    "recycling" forall n . fill (stream (delay (new n))) = n
    "delaying" forall d . delay (new (fill (stream d))) = d

The last thing to do is implement the new functions:

reverseD :: Delayed a -> Delayed a
reverseD (Delayed n ix) = Delayed n (\i -> ix (n - 1 - i))

reverse :: Vector a -> Vector a
reverse = new . fill . stream . reverseD . delay

-- I don't do bound checks for clarity
indexD :: Delayed a -> Int -> a
indexD (Delayed _ ix) i = ix i

(!) :: Vector a -> Int -> a
v ! i = indexD (delay v) i

Choosing Which Representation to Work On

The change described above, keeping the definitions of all functions that aren't index space transformations the same, has strictly better performance than the current vector. However, this isn't enough. Consider the reduction of reverse (map f xs):

reverse (map f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (mapS f (stream (delay xs)))))))))

Because no full cycle is present, no rules can fire, and an intermediate array is allocated. This is suboptimal, however, as map can also work on delayed arrays:

mapD :: (a -> b) -> Delayed a -> Delayed b
mapD (Delayed n ix) = Delayed n (f . ix)

reverse (map f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (stream (mapD f (delay xs)))))))))
  = {- delaying -}
new (fill (stream (reverseD (mapD f (delay xs)))))

Switching to the delayed representation doesn't work in all cases either. When map is fed a stream, it should ideally run mapS, as using a delayed representation would require an allocation. These problems plague many other functions that similarly can work on multiple representations, such as append, filter, and even reverse, which can operate on New as well as Delayed arrays.

As it turns out, the array recycling paper saw this problem and came up with a solution. Unfortunately, it is very specialized: it only works on single argument Stream-transforming functions that don't increase the size of the array. This works, for example, on filter or monomorphic uses of map, but excludes append and reverse. In addition, it only works for converting between Stream and New. For the general case, we need a new solution:

  "reverseN/fill..." forall f d . reverseN (fill (stream d)) = fill (stream (reverseD d))
  "reverseD/delay..." forall f n . reverseD (delay (new n)) = delay (new (reverseN n))

  "appendS/stream" forall f d1 d2 . appendS (stream d1) (stream d2) = stream (appendD d1 d2)
  "appendD/delay.../1" forall f s1 d2 . appendD (delay (new (fill s1))) d2 = delay (new (fill (appendS s1 (stream d2))))
  "appendD/delay.../2" forall f d1 s2 . appendD d1 (delay (new (fill s2))) = delay (new (fill (appendS (stream d1) s2)))

  "mapS/stream" forall f d . mapS f (stream d) = stream (mapD f d)
  "mapD/delay..." forall f s . mapD f (delay (new (fill s))) = delay (new (fill (mapS f s)))
  "mapD/delay... (monomorphic)" forall f n . mapD f (delay (new n)) = delay (new (transform (mapS f) n))
  "transform/fill" forall f s . transform f (fill s) = fill (f s)
-- and so on

The above code demonstrates three different parts of the system. In the reverse case, we have a simple single argument function that can operate uniformly on multiple representations. As such, all we do is try to move the call inward, or rightward. This moves as many representation-changing functions as possible to the outside of the expression where, all clumped together, they form cycles and can be removed. For example,

reverse (filter f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (filterS f (stream (delay xs)))))))))
  = {- reverseD/delay... -}
new (fill (stream (delay (new (reverseN (fill (filterS f (stream (delay xs)))))))))
  = {- recycling -}
new (reverseN (fill (filterS f (stream (delay xs)))))

The append case follows the same pattern, but it is slightly different in its handling of new. Because new is the only representation changing function that actually does work, adding in an extra stream to have a higher chance of fusing away the allocation is perfectly acceptable. As such, we aggressively move allocations outward even if only one is available. This change also makes the rewrite system confluent. Note that we can't do the same aggressive movement with the other representation changing functions, as we would have to introduce new allocations. In action,

append (filter f xs) (reverse ys)
  = {- inlining -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (delay (new (fill (stream (reverseD (delay ys)))))))))
  = {- delaying -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (reverseD (delay ys)))))
  = {- appendD/delay.../1 -}
new (fill (stream (delay (new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))))))
  = {- recycling -}
new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))

map is an interesting function because it can work on Delayed arrays, Streams, and New, but only if the function preserves types. Because of this, we have two cases: polymorphic and monomorphic. In the polymorphic case, the mapS/stream and mapD/delay... rules cycle through Stream and Delayed. In the monomorphic case, however, there is a different cycle, formed by mapS/stream, mapD/delay... (monomorphic), and transform/fill.

The second phase: fixing smaller inefficiencies

As it turns out, this system is provably optimal in the number of allocations. Unfortunately, allocations aren't the only thing determining performance. A system very similar to the one described above was rejected because of this. It describes the following example:

map (> 5) (map (+1) (xs // ps))
  = {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
  = {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
  = {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))

Although the result has the optimal two allocations, the two maps are executed in separate loops and so cannot be properly fused. To rectify this, I propose a two-phase system. In phase 0, the previously described system is run, eliminating all unnecessary allocations. In phase 1, we do a number of "fixup" transformations that undo some of the inward movement done in phase 0, like so:

  "stream/mapD" [1] forall f d . stream (mapD f d) = mapS f (stream d)
  "delay.../mapS" [1] forall f s . delay (new (fill (mapS f s))) = mapD f (delay (new (fill s)))

  "stream.../transform" [1] forall f n . stream (delay (new (transform f n))) = f (stream (delay (new n)))

  "delay.../reverseN" [1] forall n . delay (new (reverseN n)) = reverseD (delay (new n))

  "stream/appendD" [1] forall d1 d2 . stream (appendD d1 d2) = appendS (stream d1) (stream d2)

Note that map, which can act equally efficiently on Delayed arrays and Streams, simply moves outward, switching between the two representations in an effort to get out of the way of other transformations. However, reverse runs much more efficiently on Delayed arrays than on New, so only switches away from New. Similarly, append uses Streams and even monomorphic uses of map try to avoid New. To see this working:

map (> 5) (map (+1) (xs // ps))
  = {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
  = {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
  = {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
  = {- entering phase 1, stream/mapD -}
new (fill (mapD (> 5) (stream (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
  = {- stream.../transform -}
new (fill (mapD (> 5) (mapS (+1) (stream (delay (new (update (fill (stream (delay xs))) ps))))))

Future Work

While this system works fairly well, there are a number of ways in which it could be improved.

More representations

While delayed arrays help fusion significantly, other representations could also be useful. For example, the expression reverse (filter f xs) is currently compiled to new (reverseN (fill (filterS f (stream (delay xs))))), which allocates a filtered array then reverses it in place. Ideally, the system would simply write out the array in reverse, requiring no post-processing step. This could be accomplished with a representation similar to Stream (Int, a) which would specify not only what the elements were but where to write them.

Multiple cycles of representations

The aforementioned reverse filling only works if the number of elements has a bound known before calculation. However, due to functions like concatMap, this is not necessarily true. To fix this, there would have to be a static distinction between the two types of Streams. This would cause there to be two different ways of getting from a Delayed array to New, and so there would be multiple possible cycles of representations.

Commutivity optimizations

The current system of fusion puts functions in a strict order, conservatively assuming that each one could do almost anything, and so cannot be reordered. However, this misses out on a lot of opportunities. For example, filter f (reverse (filter g xs)) compiles to

new (fill (filterS f (stream (reverseD (delay (new (fill (filterS g (stream (delay xs))))))))))
instead of to the much more efficient
new (fill (filterS f (filterS g (stream (reverseD (delay xs))))))

Proof of allocation optimality

For those of you who are interested, here is my proof that the simple inward moving strategy produces the optimal number of allocations.

First of all, we define an expression to be a finite rooted tree whose internal verticies are functions, associated with one of a finite set of representations. The leaves and the root are special, abstract nodes that are all associated with the same representation, what we will call the manifest representation. The root has exactly one child. Note that this disallows functions that return multiple results or take different representations as arguments.

Now, assume there is some total order on the set of representations, with the manifest representation being the minimum. We say that an edge of an expression is an allocating edge if the child node's representation is greater than the parent node's representation. We say that the cost of an expression is the number of allocating edges in that expression.

Additionally, every function is associated with a fixed, nonempty set of allowed representations. We say that an expression is valid if every function is associated with one of its allowed representations.

Next we say that the inward optimization is the transformation from expressions to expressions that, at every node, sets its representation to the smallest allowed representation that is greater than or equal to the representations of its children, discounting the children that are larger than all allowed representations. This is done in a bottom up fashion, starting with the nodes just above the leaves, thereby making this transformation independent of the previous representations and so idempotent. Note that this is equivalent to the set of rewrite rules described above, as the rules that don't deal with allocations simply lower the representation of a node if the new representation is greater than or equal to that of all the children, and the aggressive rules that do deal with allocations simply raise the representation to the largest one allowed if a large child is detected.

Lemma 1: The inward optimization returns a valid expression

Because the inward optimization, by definition, assigns a representation to every function that is one of its allowed representations, it returns a valid expression.

Lemma 2: The inward optimization does not increase the cost of an expression if the input was valid

It suffices to prove that transforming a single node does not increase the cost of the expression, as the inward optimization simply transforms the tree one node at a time. Now, we can ignore the edges to the children that have a representation larger than all the allowed representation, as all valid assignments must make those edges allocating, and this optimization does not change that. The inward optimization sets the remaining edges to not be allocating, as, by definition, it picks a representation greater than or equal to the children's representations. Therefore, the only way for the inward transformation to increase the cost would be switching the parent edge to be allocating where previously no edges were allocating. However, if this was the input state, then there exists some allowed representation greater than or equal to those of the children and less than or equal to that of the parent, and the minimum allowed representation greater than or equal to those of the children, which the inward optimization picks, would satisfy this property.

Alternatively, given the correspondence with the rewrite rules given earlier in this post and the fact that none of those rewrite rules increase the number of allocations, the repeated use of those rules and so the inward optimization doesn't increase the number of allocations.

The final proof: The inward optimization produces a minimum cost expression

Consider a valid assignment of representations with minimum cost. Applying the inward optimization to it cannot increase the cost, and as we started with a minimum cost expression, it cannot decrease the cost. Therefore, the result must also be minimum cost. However, the inward optimization is independent of the starting representations, so applying the inward optimization to any tree with the same functions produces a minimum cost expression.