Monday, September 15, 2014

Durst progress

My new language is moving along slowly but steadily. The parser is now complete, and though I am tempted to rewrite it after reading about Marpa parsing, this would probably be counter-productive. Adding an interpreter was fairly simple, but this is running on an expression AST, not any sort of CFG.

The current plan is to first extend the language with functions, including some basic IO, then move on to compilation. Compilation will be done by first converting to an SSA CFG similar that that of LLVM, then converting to LLVM and letting it do the hard work of actually producing machine code.

Coming up with a good representation for the CFG is a bit difficult, because I want to have want operation to point to its operands, but naively done this is impossible in Rust, which forbids cycles of pointers. As such, I plan to create an (internally unsafe) module for a CFG using raw pointers, and use that.

Once I have basic compilation working, the plan is to play around with optimization by creating a framework similar to Hoopl. I have a number of ideas of how to improve upon that work, including interleaving the backwards and forwards passes and immediately applying transformations that are known to be valid.

Tuesday, September 2, 2014

Durst: Creating a new language in Rust

I've started a new major project, with high and lofty goals that will probably never be achieved. I am creating my own programming language, with the goal to be similar to Rust in many aspects, but ensuring greater safety using dependent types.

So far, I've built a basic compiler executable that calls into a parsing library that I wrote. The executable is fairly simple, but the parsing library was interesting, as Rust gave me goals and restrictions that weren't present in other parsing libraries I've written.

First, I wanted everything to be allocated on the stack - this allows for a compact representation and allows the compiler to perform more optimizations. This was actually pretty easy, as Rust has very simple and easy to use structs and doesn't box anything without explicit requests.

Second, I wanted the parsing framework to support recursion. This was where I started to run into problems, because Rust requires all memory to be initialized. As such, it is impossible to easily create cyclic data structures. Worse, even if I were to use something like RefCell to get around these restrictions, cycles just don't fit well into Rust's ownership model - it is hard to ask who owns whom and to prove that backreferences will only live as long as the forward references.

To get around this problem, I simply required users of recursion to actually create their own implementation of the Parser trait, meaning that they can just create the cyclic parsers on the fly.

By next week, I hope to have a full AST defined and parsing, so that I can move on to interpretation.

Saturday, January 25, 2014

Extending Array Recycling With Delayed Arrays For Index Space Transformations

First, I must apologize. My previous post said that I would write a sequel, explaining my solution to the given problem. However, I didn't, and have since forgotten all my thoughts on the subject. As such, I will not be writing a part two.

With that over with, I get to the subject of this post. The vector library has a very sophisticated fusion system, allowing it to evaluate expressions such as map (+1) (filter p xs // us) using only one new array, instead of the 3 it seems to use. However, there are a large class of operations it cannot optimize properly, namely index space transformations like reverse, backpermute, or even index. By stealing an idea from repa, these functions can be cleanly fused without any change to user code.

The Current Fusion System

The system vector uses for fusion, described in the paper Recycle Your Arrays! by Roman Leshchinskiy (like the stream fusion paper, this is very readable), is based on two intermediate representations that describe how to build the array instead of the actual elements: the Stream describes the array as something akin to Python's generators - it is a loop (stored in a special decomposed format to help the compiler) that can yield elements as it runs, while the New type is a computation (in the ST monad) that allocates and fills a mutable vector. Each of these has their respective advantages and disadvantages - the Stream allows for polymorphic transformations, arbitrary changes to the length, and produces a single tight loop filling the array, and New has random access and update but cannot grow in size or change type.

In Haskell, these types have the following definitions:

data Stream a = forall s . Stream (s -> Step s a) s Size
data Step s a = Yield a s | Skip s | Done

newtype New a = New (forall s . ST s (MVector s a))

To avoid doing multiple passes over the data, the loop in a Stream is represented as an initial state and a state transformation function that corresponds to a single iteration of the loop. Note that the loop is not forced to yield a value on every iteration, allowing the efficient implementation of functions like filter.

Conversion between these representations is achieved by a triplet of functions that go through the three representations in a cyclic fashion:

stream :: Vector a -> Stream a
fill   :: Stream a -> New a
new    :: New a    -> Vector a

Fusion can then be achieved by remove entire cycles, like so:

{-# RULES
    "fusion" forall s . stream (new (fill s)) = s
    "recycling" forall n . fill (stream (new n)) = n
  #-}

As an example of how this works, here is the reduction of map f xs // us:

map f xs // us
  = {- inlining -}
new (fill (mapS f (stream xs))) // us
  = {- inlining -}
new (updateN (fill (stream (new (fill (mapS f (stream xs)))))) us)
  = {- fusion or recycling -}
new (updateN (fill (mapS f (stream xs))) us)

The result only has one new and so only allocates a single vector

The Current Handling of Index Space Transformations

Let us take, for example, the expression reverse (reverse xs). Clearly, this should be optimized into at worst a single copy and at best a noop. However, in the current system, two reversals take place, and two arrays are allocated. This unfortunate fact is documented by a comment in the vector source:

-- FIXME: make this fuse better, add support for recycling

So why does this happen? The fundamental problem is that Streams cannot, in general, be reversed - this would require running a loop backwards. The best the library can do is stream the original vector in reverse - after all, vectors have random access. In fact, this is how reverse is implemented:

reverse = unstream . streamR

where unstream = new . fill and streamR does the aforementioned backwards streaming.

Solving the Problem With Delayed Arrays

repa, more so than vector alone, has an large number of these index space transformations. As such, the original repa paper contains a solution - delayed arrays. The basic idea is that an array can be completely described by its size and its indexing function. This representation is perfect for our index space transformations, as they now are simply function composition. Now, despite this optimization still being in repa today, it has a fundamental problem: the decision of where to use delayed arrays and where to use manifest arrays is forced upon the user. Instead of following this path, let us try to integrate it into the existing recycling framework.

The first step is to create the new representation and corresponding functions:

data Delayed a = Delayed {-# UNPACK #-} !Int (Int -> a)

delay :: Vector a -> Delayed a
delay v = Delayed (length v) (unsafeIndex v)

Now, where in the cycle of representations should the delayed arrays be? How easy it was to convert from a Vector to Delayed implies that it should be between Vector and Stream. In fact, writing the new stream function is surprisingly easy - previously, it was defined as follows:

stream v = v `seq` n `seq` (Stream.unfoldr get 0 `Stream.sized` Exact n)
  where
    n = length v

    {-# INLINE get #-}
    get i | i >= n    = Nothing
          | otherwise = case basicUnsafeIndexM v i of Box x -> Just (x, i+1)

The changed code is actually simpler:

stream (Delayed n ix) = Stream.unfoldr get 0 `Stream.sized` Exact n
  where
    {-# INLINE get #-}
    get i | i >= n    = Nothing
          | otherwise = Just (ix i, i+1)

With that in place, we can finish the framework by updating our rewrite rules:

{-# RULES
    "fusion" forall s . stream (delay (new (fill s))) = s
    "recycling" forall n . fill (stream (delay (new n))) = n
    "delaying" forall d . delay (new (fill (stream d))) = d
  #-}

The last thing to do is implement the new functions:

reverseD :: Delayed a -> Delayed a
reverseD (Delayed n ix) = Delayed n (\i -> ix (n - 1 - i))

reverse :: Vector a -> Vector a
reverse = new . fill . stream . reverseD . delay

-- I don't do bound checks for clarity
indexD :: Delayed a -> Int -> a
indexD (Delayed _ ix) i = ix i

(!) :: Vector a -> Int -> a
v ! i = indexD (delay v) i

Choosing Which Representation to Work On

The change described above, keeping the definitions of all functions that aren't index space transformations the same, has strictly better performance than the current vector. However, this isn't enough. Consider the reduction of reverse (map f xs):

reverse (map f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (mapS f (stream (delay xs)))))))))

Because no full cycle is present, no rules can fire, and an intermediate array is allocated. This is suboptimal, however, as map can also work on delayed arrays:

mapD :: (a -> b) -> Delayed a -> Delayed b
mapD (Delayed n ix) = Delayed n (f . ix)

reverse (map f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (stream (mapD f (delay xs)))))))))
  = {- delaying -}
new (fill (stream (reverseD (mapD f (delay xs)))))

Switching to the delayed representation doesn't work in all cases either. When map is fed a stream, it should ideally run mapS, as using a delayed representation would require an allocation. These problems plague many other functions that similarly can work on multiple representations, such as append, filter, and even reverse, which can operate on New as well as Delayed arrays.

As it turns out, the array recycling paper saw this problem and came up with a solution. Unfortunately, it is very specialized: it only works on single argument Stream-transforming functions that don't increase the size of the array. This works, for example, on filter or monomorphic uses of map, but excludes append and reverse. In addition, it only works for converting between Stream and New. For the general case, we need a new solution:

{-# RULES
  "reverseN/fill..." forall f d . reverseN (fill (stream d)) = fill (stream (reverseD d))
  "reverseD/delay..." forall f n . reverseD (delay (new n)) = delay (new (reverseN n))

  "appendS/stream" forall f d1 d2 . appendS (stream d1) (stream d2) = stream (appendD d1 d2)
  "appendD/delay.../1" forall f s1 d2 . appendD (delay (new (fill s1))) d2 = delay (new (fill (appendS s1 (stream d2))))
  "appendD/delay.../2" forall f d1 s2 . appendD d1 (delay (new (fill s2))) = delay (new (fill (appendS (stream d1) s2)))

  "mapS/stream" forall f d . mapS f (stream d) = stream (mapD f d)
  "mapD/delay..." forall f s . mapD f (delay (new (fill s))) = delay (new (fill (mapS f s)))
  "mapD/delay... (monomorphic)" forall f n . mapD f (delay (new n)) = delay (new (transform (mapS f) n))
  "transform/fill" forall f s . transform f (fill s) = fill (f s)
  #-}
-- and so on

The above code demonstrates three different parts of the system. In the reverse case, we have a simple single argument function that can operate uniformly on multiple representations. As such, all we do is try to move the call inward, or rightward. This moves as many representation-changing functions as possible to the outside of the expression where, all clumped together, they form cycles and can be removed. For example,

reverse (filter f xs)
  = {- inlining -}
new (fill (stream (reverseD (delay (new (fill (filterS f (stream (delay xs)))))))))
  = {- reverseD/delay... -}
new (fill (stream (delay (new (reverseN (fill (filterS f (stream (delay xs)))))))))
  = {- recycling -}
new (reverseN (fill (filterS f (stream (delay xs)))))

The append case follows the same pattern, but it is slightly different in its handling of new. Because new is the only representation changing function that actually does work, adding in an extra stream to have a higher chance of fusing away the allocation is perfectly acceptable. As such, we aggressively move allocations outward even if only one is available. This change also makes the rewrite system confluent. Note that we can't do the same aggressive movement with the other representation changing functions, as we would have to introduce new allocations. In action,

append (filter f xs) (reverse ys)
  = {- inlining -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (delay (new (fill (stream (reverseD (delay ys)))))))))
  = {- delaying -}
new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (reverseD (delay ys)))))
  = {- appendD/delay.../1 -}
new (fill (stream (delay (new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))))))
  = {- recycling -}
new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))

map is an interesting function because it can work on Delayed arrays, Streams, and New, but only if the function preserves types. Because of this, we have two cases: polymorphic and monomorphic. In the polymorphic case, the mapS/stream and mapD/delay... rules cycle through Stream and Delayed. In the monomorphic case, however, there is a different cycle, formed by mapS/stream, mapD/delay... (monomorphic), and transform/fill.

The second phase: fixing smaller inefficiencies

As it turns out, this system is provably optimal in the number of allocations. Unfortunately, allocations aren't the only thing determining performance. A system very similar to the one described above was rejected because of this. It describes the following example:

map (> 5) (map (+1) (xs // ps))
  = {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
  = {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
  = {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))

Although the result has the optimal two allocations, the two maps are executed in separate loops and so cannot be properly fused. To rectify this, I propose a two-phase system. In phase 0, the previously described system is run, eliminating all unnecessary allocations. In phase 1, we do a number of "fixup" transformations that undo some of the inward movement done in phase 0, like so:

{-# RULES
  "stream/mapD" [1] forall f d . stream (mapD f d) = mapS f (stream d)
  "delay.../mapS" [1] forall f s . delay (new (fill (mapS f s))) = mapD f (delay (new (fill s)))

  "stream.../transform" [1] forall f n . stream (delay (new (transform f n))) = f (stream (delay (new n)))

  "delay.../reverseN" [1] forall n . delay (new (reverseN n)) = reverseD (delay (new n))

  "stream/appendD" [1] forall d1 d2 . stream (appendD d1 d2) = appendS (stream d1) (stream d2)

Note that map, which can act equally efficiently on Delayed arrays and Streams, simply moves outward, switching between the two representations in an effort to get out of the way of other transformations. However, reverse runs much more efficiently on Delayed arrays than on New, so only switches away from New. Similarly, append uses Streams and even monomorphic uses of map try to avoid New. To see this working:

map (> 5) (map (+1) (xs // ps))
  = {- inlining -}
new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))
  = {- delaying -}
new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))
  = {- mapD/delay... (monomorphic) -}
new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
  = {- entering phase 1, stream/mapD -}
new (fill (mapD (> 5) (stream (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))
  = {- stream.../transform -}
new (fill (mapD (> 5) (mapS (+1) (stream (delay (new (update (fill (stream (delay xs))) ps))))))

Future Work

While this system works fairly well, there are a number of ways in which it could be improved.

More representations

While delayed arrays help fusion significantly, other representations could also be useful. For example, the expression reverse (filter f xs) is currently compiled to new (reverseN (fill (filterS f (stream (delay xs))))), which allocates a filtered array then reverses it in place. Ideally, the system would simply write out the array in reverse, requiring no post-processing step. This could be accomplished with a representation similar to Stream (Int, a) which would specify not only what the elements were but where to write them.

Multiple cycles of representations

The aforementioned reverse filling only works if the number of elements has a bound known before calculation. However, due to functions like concatMap, this is not necessarily true. To fix this, there would have to be a static distinction between the two types of Streams. This would cause there to be two different ways of getting from a Delayed array to New, and so there would be multiple possible cycles of representations.

Commutivity optimizations

The current system of fusion puts functions in a strict order, conservatively assuming that each one could do almost anything, and so cannot be reordered. However, this misses out on a lot of opportunities. For example, filter f (reverse (filter g xs)) compiles to

new (fill (filterS f (stream (reverseD (delay (new (fill (filterS g (stream (delay xs))))))))))
instead of to the much more efficient
new (fill (filterS f (filterS g (stream (reverseD (delay xs))))))

Proof of allocation optimality

For those of you who are interested, here is my proof that the simple inward moving strategy produces the optimal number of allocations.

First of all, we define an expression to be a finite rooted tree whose internal verticies are functions, associated with one of a finite set of representations. The leaves and the root are special, abstract nodes that are all associated with the same representation, what we will call the manifest representation. The root has exactly one child. Note that this disallows functions that return multiple results or take different representations as arguments.

Now, assume there is some total order on the set of representations, with the manifest representation being the minimum. We say that an edge of an expression is an allocating edge if the child node's representation is greater than the parent node's representation. We say that the cost of an expression is the number of allocating edges in that expression.

Additionally, every function is associated with a fixed, nonempty set of allowed representations. We say that an expression is valid if every function is associated with one of its allowed representations.

Next we say that the inward optimization is the transformation from expressions to expressions that, at every node, sets its representation to the smallest allowed representation that is greater than or equal to the representations of its children, discounting the children that are larger than all allowed representations. This is done in a bottom up fashion, starting with the nodes just above the leaves, thereby making this transformation independent of the previous representations and so idempotent. Note that this is equivalent to the set of rewrite rules described above, as the rules that don't deal with allocations simply lower the representation of a node if the new representation is greater than or equal to that of all the children, and the aggressive rules that do deal with allocations simply raise the representation to the largest one allowed if a large child is detected.

Lemma 1: The inward optimization returns a valid expression

Because the inward optimization, by definition, assigns a representation to every function that is one of its allowed representations, it returns a valid expression.

Lemma 2: The inward optimization does not increase the cost of an expression if the input was valid

It suffices to prove that transforming a single node does not increase the cost of the expression, as the inward optimization simply transforms the tree one node at a time. Now, we can ignore the edges to the children that have a representation larger than all the allowed representation, as all valid assignments must make those edges allocating, and this optimization does not change that. The inward optimization sets the remaining edges to not be allocating, as, by definition, it picks a representation greater than or equal to the children's representations. Therefore, the only way for the inward transformation to increase the cost would be switching the parent edge to be allocating where previously no edges were allocating. However, if this was the input state, then there exists some allowed representation greater than or equal to those of the children and less than or equal to that of the parent, and the minimum allowed representation greater than or equal to those of the children, which the inward optimization picks, would satisfy this property.

Alternatively, given the correspondence with the rewrite rules given earlier in this post and the fact that none of those rewrite rules increase the number of allocations, the repeated use of those rules and so the inward optimization doesn't increase the number of allocations.

The final proof: The inward optimization produces a minimum cost expression

Consider a valid assignment of representations with minimum cost. Applying the inward optimization to it cannot increase the cost, and as we started with a minimum cost expression, it cannot decrease the cost. Therefore, the result must also be minimum cost. However, the inward optimization is independent of the starting representations, so applying the inward optimization to any tree with the same functions produces a minimum cost expression.

Saturday, October 27, 2012

Disciple-style Regions in Haskell, Part 1

I've admired ddc for quite a while, in particular its region system. Unfortunately, it also has many annoyances, and one big one is complexity. To rectify this, I've created a system implementing regions in Haskell.

The Essence of Regions

At first we might adopt a definition of regions based on their use in memory management: they are areas where you can allocate memory and work with it, as show by the class below.

-- WRONG!
class (Monad (Environment r)) => Region r where
    data Ref r :: * -> *
    type Environment r :: * -> *

    newRef :: a -> Environment r (Ref r a)
    readRef :: Ref r a -> Environment r a
    writeRef :: a -> Ref r a -> Environment r ()
There are a couple of instances of this - consider ST, or IO, or STM. Unfortunately, this is not what we want. We need to support, for example, immutable regions. Not only that, but in rare cases we might want a write only region (as a random example, a password store). You might have regions that you can't create data in. In the end, we have reduced the idea of a region to something very small: a place with data.
data family Ref r :: * -> *
Note that this also allows us to do away with the monad. Of course, mutation and reading and creation are common, so we should have some classes:
class (Monad m) => Writable r m where
    writeRef :: a -> Ref r a -> m ()

class (Monad m) => Readable r m where
    readRef :: Ref r a -> m a

class (Monad m) => Creatable r m where
    newRef :: a -> m (Ref r a)
Following the previous idea of not needing a fixed monad, we do not restrict a reference to have a single monad assosciated with it.

Some Examples

data Mut s
newtype instance Ref (Mut s) a = MutRef (STRef s a)

instance Writable (Mut s) (ST s) where
    writeRef val (MutRef ref) = writeSTRef val ref

instance Readable (Mut s) (ST s) where
    readRef (MutRef ref) = readSTRef ref

instance Creatable (Mut s) (ST s) where
    createRef val = fmap MutRef $ newSTRef val

data Immut
newtype instance Ref Immut a = ImmutRef a

instance (Monad m) => Readable Immut m where
    readRef (ImmutRef val) = return val

instance (Monad m) => Creatable Immut m where
    createRef val = return (ImmutRef val)

data Atomic
newtype instance Ref Atomic a = AtomicRef (TVar a)

instance Writable Atomic STM where
    writeRef val (AtomicRef ref) = writeTVar val ref

instance Readable Atomic STM where
    readRef (AtomicRef ref) = readTVar ref

instance Creatable Atomic STM where
    createRef val = fmap AtomicRef $ newTVar val

instance Writable Atomic IO where
    writeRef val (AtomicRef ref) = writeTVarIO val ref

instance Readable Atomic IO where
    readRef (AtomicRef ref) = readTVarIO ref

instance Creatable Atomic IO where
    createRef val = fmap AtomicRef $ newTVarIO val
To support things like pointers or specialized references, we probably want to add the ability to restrict the types that can be put in a Ref, but this is beyond the scope of this post.

Building Complex Data Structures

As an example, I will build a cyclic doubly linked list with this framework. Actually constructing the representation and basic operations of a complex data structure is pretty easy:

data DList r a = DList (Ref r (DList r a)) a (Ref r (DList r a))

head :: (Monad m) => DList r a -> m a
head (DList _ x _) = return x

tail :: (Readable r m, Writable r m) => DList r a -> m (DList r a)
tail (DList lRef _ rRef) = do
    l@(DList ll xl rl) <- readRef lRef
    r@(DList lr xr rr) <- readRef rRef
    writeRef lr l
    writeRef rl r
    return r

singleton :: (Creatable r m, MonadFix m) => a -> m (DList r a)
singleton x = do rec
    l <- createRef result
    r <- createRef result
    let result = DList l x r
    return result

-- And so on...
However, how do we switch between representations? With arrays, we usually have the freeze and thaw methods. Here, we have a possibly infinite collection of regions to switch between - how can we do it? I'll post my answer in part 2, since this post is getting a bit long.

Friday, October 12, 2012

Stream Fusion for Conduits

Recently I've been looking at Michael Snoyman's conduit package. Now, conduit is a pretty impressive library, but I was skeptical about its performance. If you look at its source, you see almost every function recursively going down the stack of conduit operations and creating a new one. Now only that, but deforestation is scarce, so it is hard to justify constantly rebuilding these trees.

To fix this, the first thing I tried was using the free monad implementation described here. Unfortunately, this has a large problem: anything besides monad operations is not easy. Functions like pipe, which composes pipes, require building a structure and then traversing it.

After trying a number of similar methods (combining church encoding with Scott encoding, etc.), all of which were equally unsuccessful, I came upon the solution: Stream Fusion.

Stream Fusion

Stream Fusion (this paper is very readable, by the way) is a techneque for removing unnecessary list or array traversals by using an inversion of control. The key definitions are as follows:

data Step s a = Skip s | Yield s a | Done
data Stream a = forall s . Stream (s -> Step s a) s

stream :: [a] -> Stream a
stream {-list-} = Stream next {-list-} where
    next [] = Done
    next (a : as) = Yield as a

unstream :: Stream a -> [a]
unstream (Stream next s0) = go s0 where
    go s = case next s of
        Done -> []
        Skip s -> go s
        Yield s a -> a : go s
The library on hackage modifies these functions lightly to force constructor specialization to happen, but this is irrelevant.

Next we have a library full of functions on stream, like so:

mapStream :: (a -> b) -> Stream a -> Stream b
mapStream f (Stream next s) = Stream next' s where
    next' s = case next s of
        Done -> Done
        Yield s' a -> Yield s' (f a)
        Skip s' -> Skip s'

filterStream :: (a -> Bool) -> Stream a -> Stream a
filterStream p (Stream next s) = Stream next' s where
    next' s = case next s of
        Done -> Done
        Yield s' a
            | p a       -> Yield s' a
            | otherwise -> Skip s'
        Skip s' -> Skip s'
The filterStream function here is especially interesting, as it shows how the Skip constructor can be useful - it allows filterStream to be non-recursive. This helps GHC's optimizer immensely.

Next, we implement a list library on top of this stream library:

map :: (a -> b) -> [a] -> [b]
map f = unstream . mapStream f . stream

filter :: (a -> Bool) -> [a] -> [a]
filter p = unstream . filterStream p . stream
Note that if we inline these definitions, we get things like this:
map f . filter g . map h =
    unstream . mapStream f .
    stream . unstream .
    filterStream g .
    stream . unstream .
    mapStream h . stream
Notice how stream . unstream shows up a lot? Well, at least approximately, streams and lists are isomorphic, so we can add this rule to our library:
{-# RULES
"stream/unstream" forall s . stream (unstream s) = s
  #-}
With this simplification, we now have only one stream and one unstream in out pipeline. This means that we now have one tight loop, instead of three, increasing performance.

Applying This to Conduits

So far, our discussion has only applied to lists. How do we do something like this for, say, a conduit? The answer lies near the end of the paper, in section 9.3, wchich brings up the example of a binary tree:

data Tree a b = Leaf a | Branch b (Tree a) (Tree a)
To apply stream fusion to this type, we modify the Step type:
data Step a b s = Leaf_s a | Branch_s b s s | Skip s
data Stream a b = forall s . Stream (s -> Step a b s) s
Moving on to conduits, we have what follows.
data Step l i o u m r s =
    HaveOutput s (m ()) o
  | NeedInput (i -> s) (u -> s)
  | Done r
  | PipeM (m s)
  | Leftover s l
  | Skip s
data Stream l i o u m r = forall s . Stream (s -> Step l i o u m r s) s
The full code is available here.

Performance

As an optimization library, benchmarks are pretty improtant. I've only done preliminary testing, but here are the results. Especially for something so quickly thrown together by someone who doesn't understand the nuances of SpecConstr, I'd say the results are very promising.