tag:blogger.com,1999:blog-1330661694970365632015-09-16T13:59:19.038-07:00Rarely Repetitive RamblingsJonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-133066169497036563.post-18723913989720664702014-12-17T08:16:00.001-08:002014-12-17T08:16:20.020-08:00Durst (lack of) progress and Rust<p>Lately, I haven't gotten much quantitatively done on Durst. I've written barely any new code. In part, this is because I keep getting distracted, both with other projects and with every new idea I have for the language - I'm too busy planning Durst to actually make it. This is unfortunate, and I hope I'll get some stuff done over winter break, but for now, I'll talk about the compiler I have been working on: Rust</p> <p>Rust's compiler is honestly quite ugly, and so I've made a few attempts to clean some of it (specifically the name resolution code) up. However, I've mainly been working on the standard library's data structures. I really enjoy the puzzle of figuring out how to make APIs that are perfectly safe (they make it impossible to, e.g., read uninitialized memory or to use-after-free) yet fully powerful. Recently, I redesigned <code>BTreeMap</code>'s internal APIs to make them far safer (and faster).</p> <p>The core problem is that navigating through a <code>BTree</code> often requires maintaining a stack of the nodes that have been visited to get to the current node. However, any of these nodes could be mutated to invalidate the rest, so Rust won't let us just store a stack of mutable references. As such, there is an internal <code>stack</code> that manages unsafely holding a stack of nodes, only allowing mutation at the top. The issue that arises is figuring out how to push to the stack. If the stack module allowed you to push arbitrary nodes, then it wouldn't work for its designed purpose, namely holding a contiguous path down a <code>BTree</code>. It would be easy to add a node farther down the tree or even from a different tree.</p> <p>To solve this, had the stack expose the top node with a special <code>'id</code> lifetime, only accepting a request to push that same node's children. I'm still trying to figure out the best way to explain this, but the PR is <a href="https://github.com/rust-lang/rust/pull/18028">here</a></p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-17969077557847740912014-12-17T08:02:00.001-08:002014-12-17T08:02:36.534-08:00Backpropagation, memory, and other optimization troubles<p>In my previous post, I described an intermediate representation that made a number of optimization problems much easier. However, it still has some complications.</p> <h2>Memory</h2><p>Naively, I could just represent memory operations as side effects, modifying the world. However, this is suboptimal for exactly the reasons that I chose to keep operations like addition out of the normal flow - it hinders optimization. As such, I need some other representation of memory. libFIRM takes the approach of having a global "Store" that it passes in and out of every write and into every read. This has a number of large advantages - it naturally expresses that reads can be reordered and that memory operations are independent of input and output. Moreover, if this is extended with a bunch of separate Stores that can be split and merged, this provides a perfect model for aliasing - a Store can be split into two nonaliasing pieces, and operations on one won't affect the other. However, this has the same issue as the state machine representation of control flow - there is nothing preventing all of the memory from being duplicated.</p> <p>One idea of had to get around this is to make every write to memory produce a read-only Store and to take a "memory continuation" to the next write. This circumvents the issue of duplicated memory, but comes with its own problems. Merging separate stores becomes inexpressible, and I would need a new MemoryPhi node to handle conditionals in the memory pipeline. Given these problems, I am currently leaning towards a state-based approach, but haven't fully decided yet</p> <h2>Backpropagation</h2><p>In a simple list of instructions, it is easy to do the sort of dataflow analysis I described in the previous post. However, with my new IR, it becomes harder, involving a lot of going "backwards". This is because just following the control flow chain will not result in every node being analysed, as many operations are out of the control flow. To remedy this, the optimizer has to be able to take a node whose inputs haven't been analysed, analyse the inputs, and then do another analysis of the use. This is tedious, but it also makes things a bit more consistent - with the switch to continuation-based control flow, analysing later instructions involves analysing the arguments of earlier instructions.</p> <p>The other place where this sort off backward analysis becomes crucial is in situations like the following:</p> <code><pre><br />if x == 5:<br /> y := x<br />else:<br /> y := 5<br /></pre></code> <p>This should be optimized to <code>y := 5</code>. However, to do this, the optimizer has to know that <code>x</code> is 5 when analysing <code>y := x</code>. This is more difficult than it looks, because it has to extract this information from the result of a comparison. It has to go backward in the tree, realizing that since <code>x == 5</code> is true, <code>x</code> must be 5.</p> <p>In general, it seems that the question that the optimizer is in general asking is: "Here is a node. Here are some facts I know to be true. What is the most simplified form of every node that this node depends upon?"</p> <h2>Allocation and random numbers</h2><p>Ideally, the optimizer should be able to remove unused allocations and reorder allocation. However, I don't see a way to have them not in the control flow, as they do have side effects. Two runs of an allocation function should not return the same result, and so allocation cannot be a pure function. The same issue shows up with random numbers - unless the programmer cares about replayability, the compiler should be able to reorder and remove calls to a random number generator. This problem in both these cases is the same: we have a function that is not idempotent (calling it twice returns two different things) yet is not side-effecting. I honestly have no idea how to deal with this.</p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-66860922713485294462014-12-17T07:31:00.000-08:002014-12-17T07:31:07.019-08:00A libFIRM-inspired framework for optimization<p>While just translating code from my new language to assembly is cool and useful, what really intrigues me most about compilers is optimization. Ideally, I'd like my compiler to not just produce binaries, but produce fast binaries. Additionally, I'd like to do this in an extensible way, making it easy to write new optimizations and to have them integrate well with the existing passes.</p> <p>The most powerful and useful way to integrate analyses and transforms that I have seen is the technique of Lerner, Grove, and Chambers, which interleaves various transforms so that each one can benefit from the simplifications made by the rest. Much of my plan for implementing it follows the strategy taken in <a href="http://www.cs.tufts.edu/~nr/pubs/dfopt.pdf">Hoopl</a>. In more detail, each optimization pass is made up of two functions: an analysis function and a rewrite function. The rewrite function takes the current dataflow state and transforms an instruction into a simpler form. The analysis function just steps the dataflow state forward. To pass information between optimizations, every rewrite is run on an instruction and then al the analyses are done, making all analyses use the maximally simplified instruction.</p> <p>As an example as to why this is powerful, consider the interaction between constant propagation and constant folding on the following pseudocode:</p> <code><pre><br />x := 5<br />y := x + 2<br />z := x + y<br /></pre></code> <p>If constant propagation is run, then this will result in</p> <code><pre><br />x := 5<br />y := 5 + 2<br />z := 5 + y<br /></pre></code> <p>which constant folding will reduce to</p> <code><pre><br />x := 5<br />y := 7<br />z := 5 + y<br /></pre></code> <p>This is definitely suboptimal, as <code>z</code> should have been optimized to 12. Switching the order of the two passes does not help, actually making the code worse, as constant folding never simplifies anything.</p> <p>In contrast, when the two optimizations are interleaved, they produce optimal code. Before constant propagation can analyse <code>y := 5 + 2</code>, constant folding rewrites it to <code>y := 7</code>. Constant propagation recognises this new form, substituting to make <code>z := 5 + 7</code>, which is then folded into 12.</p> <h2>Intermediate Representation</h2><p>While having a way to optimize code is important, it needs a representation of the code that is easy to analyse to be effective. A simple AST of my language would work, but would be not very amenable to optimization and would not be very general. To defer the questions of register allocation and to make optimization opportunities more obvious, I will use an <a href="https://en.wikipedia.org/wiki/Static_single_assignment_form">SSA form</a> representation. This has numerous advantages which I will leave out here, but they are substantial enough that most modern optimizers (GCC and LLVM are notable examples) use it.</p> <p>Beyond just using an SSA form language, I am using an idea inspired by Cliff Click's <a href="http://www.grothoff.org/christian/teaching/2007/3353/papers/click95simple.pdf"> graph-based intermediate representation</a>, used in Java's HotSpot VM and in <a href="http://pp.ipd.kit.edu/firm/">libFIRM</a>. Instead of just having linear sequences of instructions, only side-effecting instructions are put into control flow, obviating the need to rearrange instructions carefully. This idea of separating out the pure computations is exactly what makes Haskell so highly optimized - GHC can do many transformations by knowing that order of computation does not matter.</p> <p>I deviate from Click in one crucial way, however. His representation treats the world as a state machine, having some "world" object that it passed in and out of every operation that needs too modify the environment. GHC has something similar for its IO representation, using <code>State# RealWorld</code> as this object. This has some unfortunate consequences, though. First of all, infinite loops are no longer well formed. These result in a circular data dependency, not tied to any exit nodes which are used to detect dead code. Secondly, there is nothing stopping some badly designed optimization from "duplicating" the world, trying to do I/O and then roll it back with another world object.</p> <p>To remedy these issues, I use a continuation passing style implementation of the world. Instead of having each operation take a value from its predecessor, each operation holds a continuation pointing at its successor. This prevents any duplication of world state, as an operation can only point to one successor, and it solves the problem of infinite loops by looking at thing reachable from the start, not the end. Also, dead code analysis is made much simpler, as something is live iff it is reachable from the start node. <code>If</code> statements can be just selecting between two values based on a boolean, with the control flow application involving continuations.</p> <h2>Examples</h2> <p>Doing my best with ascii diagrams, here are a few examples of my IR.</p> <code><pre><br />x := 13<br />y := 12<br />print(x + y)<br /></pre></code> <p>becomes</p> <code><pre><br />[Start] [Const 13]<br /> | ^<br /> V |<br />[Print] -> [Add] -> [Const 12]<br /> |<br /> V<br />[Return]<br /></pre></code> <code><pre><br />x := true<br />if x:<br /> y := 5<br />else:<br /> y := 7<br />print(y)<br /></pre></code> <p>becomes</p> <code><pre><br />[Start]<br /> |<br /> | >--> [Const true]<br /> V |<br />[ If ]<br /> | |<br /> | | >--> [Const 5]<br /> V V |<br />[ Phi ] --> [Const 7]<br /> | ^<br /> V |<br />[Print]<br /> |<br /> V<br />[Return]<br /></pre></code> <p>and is optimized to (assuming x was actually variable)</p> <code><pre><br />[Start]<br /> |<br /> | [Const true]<br /> | ^<br /> | | >---> [Const 5]<br /> V | |<br />[Print] -> [ If ] ---> [Const 7]<br /> |<br /> V<br />[Return]<br /></pre></code> <code><pre><br />x := 0<br />loop forever:<br /> print(x)<br /> x := x + 1<br /></pre></code> <p>becomes</p> <code><pre><br />[Start]<br /> | V-----------------------<<br /> | | >--> [Const 0] |<br /> V V | |<br />[ Phi ]--V [Const 1] |<br /> | ^ | ^ |<br /> | | >----V | |<br /> | +------ [Add] |<br /> V | |<br />[Print] --------------------^<br /></pre></code>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-51495926695511571492014-09-30T03:27:00.000-07:002014-09-30T03:27:37.700-07:00Durst progress<p>These last two weeks have been quite slow on the code side of things, primarily because I've spent my time reading up on LLVM and secondarily because I have been less productive. Hopefully, this reasearch will allow me to actually get code generation working. This post is somewhat of a haphazard collection of notes about what I have read.</p> <p>Unfortunately, the compilation process is very complicated. To create an executable in a scalable way, I must transform the code to LLVM IR, emit an object file, track which object files need to be used, and link them together. The IR creation step seems thankfully quite simple, but object files and linking can be quite confusing. For now, though, I plan on just generating a single object and executable, so I shouldn't have to worry too much.</p> <p>The bindings to LLVM that I plan on using in the <a href="http://doc.rust-lang.org/rustc_llvm/index.html">rustc_llvm</a> crate are unstable and badly documented, but they seem to mainly correspond directly to LLVM functions, so I should hopefully be able to just follow the Kaleidoscope tutorial and <a href="https://github.com/hobinjk/rusty-kaleidoscope">this</a> implementation of it. As of yet, durst has not control flow, which makes things easier.</p> <p>Rustc, whose source I've been reading and modifying in hopes of understanding compilers better, is very complex. Unfortunately, I haven't really managed to get much about compiler design yet, since I'm having trouble even getting started with the whole compilation system.</p> <p>Hopefully, I'll actually have significant code changes to talk about next week.</p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-24094193802064318402014-09-15T05:50:00.000-07:002014-09-15T05:50:17.474-07:00Durst progress<p>My new language is moving along slowly but steadily. The parser is now complete, and though I am tempted to rewrite it after reading about <a href="https://jeffreykegler.github.io/Marpa-web-site/">Marpa parsing</a>, this would probably be counter-productive. Adding an interpreter was fairly simple, but this is running on an expression AST, not any sort of CFG.</p> <p>The current plan is to first extend the language with functions, including some basic IO, then move on to compilation. Compilation will be done by first converting to an SSA CFG similar that that of LLVM, then converting to LLVM and letting it do the hard work of actually producing machine code.</p> <p>Coming up with a good representation for the CFG is a bit difficult, because I want to have want operation to point to its operands, but naively done this is impossible in Rust, which forbids cycles of pointers. As such, I plan to create an (internally unsafe) module for a CFG using raw pointers, and use that.</p> <p>Once I have basic compilation working, the plan is to play around with optimization by creating a framework similar to <a href="https://research.microsoft.com/en-us/um/people/simonpj/papers/c--/dfopt.pdf">Hoopl</a>. I have a number of ideas of how to improve upon that work, including interleaving the backwards and forwards passes and immediately applying transformations that are known to be valid.</p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-49697394448659724942014-09-02T21:11:00.000-07:002014-09-02T21:11:08.882-07:00Durst: Creating a new language in Rust<p>I've started a new major project, with high and lofty goals that will probably never be achieved. I am creating my own programming language, with the goal to be similar to Rust in many aspects, but ensuring greater safety using dependent types.</p> <p>So far, I've built a basic compiler executable that calls into a parsing library that I wrote. The executable is fairly simple, but the parsing library was interesting, as Rust gave me goals and restrictions that weren't present in other parsing libraries I've written.</p> <p>First, I wanted everything to be allocated on the stack - this allows for a compact representation and allows the compiler to perform more optimizations. This was actually pretty easy, as Rust has very simple and easy to use structs and doesn't box anything without explicit requests.</p> <p>Second, I wanted the parsing framework to support recursion. This was where I started to run into problems, because Rust requires all memory to be initialized. As such, it is impossible to easily create cyclic data structures. Worse, even if I were to use something like <code>RefCell</code> to get around these restrictions, cycles just don't fit well into Rust's ownership model - it is hard to ask who owns whom and to prove that backreferences will only live as long as the forward references.</p> <p>To get around this problem, I simply required users of recursion to actually create their own implementation of the <code>Parser</code> trait, meaning that they can just create the cyclic parsers on the fly.</p> <p>By next week, I hope to have a full AST defined and parsing, so that I can move on to interpretation.</p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-48791588052859997972014-01-25T10:56:00.000-08:002014-01-25T10:56:09.245-08:00Extending Array Recycling With Delayed Arrays For Index Space Transformations<p>First, I must apologize. My previous post said that I would write a sequel, explaining my solution to the given problem. However, I didn't, and have since forgotten all my thoughts on the subject. As such, I will not be writing a part two.</p><p>With that over with, I get to the subject of this post. The <a href="https://hackage.haskell.org/package/vector"><code>vector</code></a> library has a very sophisticated fusion system, allowing it to evaluate expressions such as <code>map (+1) (filter p xs // us)</code> using only one new array, instead of the 3 it seems to use. However, there are a large class of operations it cannot optimize properly, namely index space transformations like <code>reverse</code>, <code>backpermute</code>, or even <code>index</code>. By stealing an idea from <a href="https://hackage.haskell.org/package/repa"><code>repa</code></a>, these functions can be cleanly fused without any change to user code.</p> <h2>The Current Fusion System</h2> <p>The system <code>vector</code> uses for fusion, described in the paper <a href="https://www.cse.unsw.edu.au/~rl/publications/recycling.pdf">Recycle Your Arrays!</a> by Roman Leshchinskiy (like the stream fusion paper, this is very readable), is based on two intermediate representations that describe how to build the array instead of the actual elements: the <code>Stream</code> describes the array as something akin to Python's generators - it is a loop (stored in a special decomposed format to help the compiler) that can yield elements as it runs, while the <code>New</code> type is a computation (in the <code>ST</code> monad) that allocates and fills a mutable vector. Each of these has their respective advantages and disadvantages - the <code>Stream</code> allows for polymorphic transformations, arbitrary changes to the length, and produces a single tight loop filling the array, and <code>New</code> has random access and update but cannot grow in size or change type.</p> <p>In Haskell, these types have the following definitions:</p><code><pre><br />data Stream a = forall s . Stream (s -> Step s a) s Size<br />data Step s a = Yield a s | Skip s | Done<br /><br />newtype New a = New (forall s . ST s (MVector s a))<br /></pre></code><p>To avoid doing multiple passes over the data, the loop in a <code>Stream</code> is represented as an initial state and a state transformation function that corresponds to a single iteration of the loop. Note that the loop is not forced to yield a value on every iteration, allowing the efficient implementation of functions like <code>filter</code>.</p> <p>Conversion between these representations is achieved by a triplet of functions that go through the three representations in a cyclic fashion:</p><code><pre><br />stream :: Vector a -> Stream a<br />fill :: Stream a -> New a<br />new :: New a -> Vector a<br /></pre></code> <p>Fusion can then be achieved by remove entire cycles, like so:</p><code><pre><br />{-# RULES<br /> "fusion" forall s . stream (new (fill s)) = s<br /> "recycling" forall n . fill (stream (new n)) = n<br /> #-}<br /></pre></code> <p>As an example of how this works, here is the reduction of <code>map f xs // us</code>:</p><code><pre><br />map f xs // us<br /> = {- inlining -}<br />new (fill (mapS f (stream xs))) // us<br /> = {- inlining -}<br />new (updateN (fill (stream (new (fill (mapS f (stream xs)))))) us)<br /> = {- fusion or recycling -}<br />new (updateN (fill (mapS f (stream xs))) us)<br /></pre></code><p>The result only has one <code>new</code> and so only allocates a single vector</p> <h2>The Current Handling of Index Space Transformations</h2> <p>Let us take, for example, the expression <code>reverse (reverse xs)</code>. Clearly, this should be optimized into at worst a single copy and at best a noop. However, in the current system, two reversals take place, and two arrays are allocated. This unfortunate fact is documented by a comment in the <code>vector</code> source:</p><code><pre><br />-- FIXME: make this fuse better, add support for recycling<br /></pre></code> <p>So why does this happen? The fundamental problem is that <code>Stream</code>s cannot, in general, be reversed - this would require running a loop backwards. The best the library can do is stream the original vector in reverse - after all, vectors have random access. In fact, this is how <code>reverse</code> is implemented:</p><code><pre><br />reverse = unstream . streamR<br /></pre></code><p>where <code>unstream = new . fill</code> and <code>streamR</code> does the aforementioned backwards streaming.</p> <h2>Solving the Problem With Delayed Arrays</h2> <p><code>repa</code>, more so than <code>vector</code> alone, has an large number of these index space transformations. As such, <a href="https://www.cse.unsw.edu.au/~chak/papers/repa.pdf">the original <code>repa</code> paper</a> contains a solution - delayed arrays. The basic idea is that an array can be completely described by its size and its indexing function. This representation is perfect for our index space transformations, as they now are simply function composition. Now, despite this optimization still being in <code>repa</code> today, it has a fundamental problem: the decision of where to use delayed arrays and where to use manifest arrays is forced upon the user. Instead of following this path, let us try to integrate it into the existing recycling framework.</p> <p>The first step is to create the new representation and corresponding functions:</p><code><pre><br />data Delayed a = Delayed {-# UNPACK #-} !Int (Int -> a)<br /><br />delay :: Vector a -> Delayed a<br />delay v = Delayed (length v) (unsafeIndex v)<br /></pre></code> <p>Now, where in the cycle of representations should the delayed arrays be? How easy it was to convert from a <code>Vector</code> to <code>Delayed</code> implies that it should be between <code>Vector</code> and <code>Stream</code>. In fact, writing the new <code>stream</code> function is surprisingly easy - previously, it was defined as follows:</p><code><pre><br />stream v = v `seq` n `seq` (Stream.unfoldr get 0 `Stream.sized` Exact n)<br /> where<br /> n = length v<br /><br /> {-# INLINE get #-}<br /> get i | i >= n = Nothing<br /> | otherwise = case basicUnsafeIndexM v i of Box x -> Just (x, i+1)<br /></pre></code> <p>The changed code is actually simpler:</p><code><pre><br />stream (Delayed n ix) = Stream.unfoldr get 0 `Stream.sized` Exact n<br /> where<br /> {-# INLINE get #-}<br /> get i | i >= n = Nothing<br /> | otherwise = Just (ix i, i+1)<br /></pre></code> <p>With that in place, we can finish the framework by updating our rewrite rules:</p><code><pre><br />{-# RULES<br /> "fusion" forall s . stream (delay (new (fill s))) = s<br /> "recycling" forall n . fill (stream (delay (new n))) = n<br /> "delaying" forall d . delay (new (fill (stream d))) = d<br /> #-}<br /></pre></code> <p>The last thing to do is implement the new functions:</p><code><pre><br />reverseD :: Delayed a -> Delayed a<br />reverseD (Delayed n ix) = Delayed n (\i -> ix (n - 1 - i))<br /><br />reverse :: Vector a -> Vector a<br />reverse = new . fill . stream . reverseD . delay<br /><br />-- I don't do bound checks for clarity<br />indexD :: Delayed a -> Int -> a<br />indexD (Delayed _ ix) i = ix i<br /><br />(!) :: Vector a -> Int -> a<br />v ! i = indexD (delay v) i<br /></pre></code> <h2>Choosing Which Representation to Work On</h2><p>The change described above, keeping the definitions of all functions that aren't index space transformations the same, has strictly better performance than the current <code>vector</code>. However, this isn't enough. Consider the reduction of <code>reverse (map f xs)</code>: <code><pre><br />reverse (map f xs)<br /> = {- inlining -}<br />new (fill (stream (reverseD (delay (new (fill (mapS f (stream (delay xs)))))))))<br /></pre></code> <p>Because no full cycle is present, no rules can fire, and an intermediate array is allocated. This is suboptimal, however, as <code>map</code> can also work on delayed arrays:</p> <code><pre><br />mapD :: (a -> b) -> Delayed a -> Delayed b<br />mapD (Delayed n ix) = Delayed n (f . ix)<br /><br />reverse (map f xs)<br /> = {- inlining -}<br />new (fill (stream (reverseD (delay (new (fill (stream (mapD f (delay xs)))))))))<br /> = {- delaying -}<br />new (fill (stream (reverseD (mapD f (delay xs)))))<br /></pre></code> <p>Switching to the delayed representation doesn't work in all cases either. When <code>map</code> is fed a stream, it should ideally run <code>mapS</code>, as using a delayed representation would require an allocation. These problems plague many other functions that similarly can work on multiple representations, such as <code>append</code>, <code>filter</code>, and even <code>reverse</code>, which can operate on <code>New</code> as well as <code>Delayed</code> arrays.</p> <p>As it turns out, the array recycling paper saw this problem and came up with a solution. Unfortunately, it is very specialized: it only works on single argument <code>Stream</code>-transforming functions that don't increase the size of the array. This works, for example, on <code>filter</code> or monomorphic uses of <code>map</code>, but excludes <code>append</code> and <code>reverse</code>. In addition, it only works for converting between <code>Stream</code> and <code>New</code>. For the general case, we need a new solution:</p> <code><pre><br />{-# RULES<br /> "reverseN/fill..." forall f d . reverseN (fill (stream d)) = fill (stream (reverseD d))<br /> "reverseD/delay..." forall f n . reverseD (delay (new n)) = delay (new (reverseN n))<br /><br /> "appendS/stream" forall f d1 d2 . appendS (stream d1) (stream d2) = stream (appendD d1 d2)<br /> "appendD/delay.../1" forall f s1 d2 . appendD (delay (new (fill s1))) d2 = delay (new (fill (appendS s1 (stream d2))))<br /> "appendD/delay.../2" forall f d1 s2 . appendD d1 (delay (new (fill s2))) = delay (new (fill (appendS (stream d1) s2)))<br /><br /> "mapS/stream" forall f d . mapS f (stream d) = stream (mapD f d)<br /> "mapD/delay..." forall f s . mapD f (delay (new (fill s))) = delay (new (fill (mapS f s)))<br /> "mapD/delay... (monomorphic)" forall f n . mapD f (delay (new n)) = delay (new (transform (mapS f) n))<br /> "transform/fill" forall f s . transform f (fill s) = fill (f s)<br /> #-}<br />-- and so on<br /></pre></code> <p>The above code demonstrates three different parts of the system. In the <code>reverse</code> case, we have a simple single argument function that can operate uniformly on multiple representations. As such, all we do is try to move the call inward, or rightward. This moves as many representation-changing functions as possible to the outside of the expression where, all clumped together, they form cycles and can be removed. For example,</p> <code><pre><br />reverse (filter f xs)<br /> = {- inlining -}<br />new (fill (stream (reverseD (delay (new (fill (filterS f (stream (delay xs)))))))))<br /> = {- reverseD/delay... -}<br />new (fill (stream (delay (new (reverseN (fill (filterS f (stream (delay xs)))))))))<br /> = {- recycling -}<br />new (reverseN (fill (filterS f (stream (delay xs)))))<br /></pre></code> <p>The <code>append</code> case follows the same pattern, but it is slightly different in its handling of <code>new</code>. Because <code>new</code> is the only representation changing function that actually does work, adding in an extra <code>stream</code> to have a higher chance of fusing away the allocation is perfectly acceptable. As such, we aggressively move allocations outward even if only one is available. This change also makes the rewrite system confluent. Note that we can't do the same aggressive movement with the other representation changing functions, as we would have to introduce new allocations. In action,</p> <code><pre><br />append (filter f xs) (reverse ys)<br /> = {- inlining -}<br />new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (delay (new (fill (stream (reverseD (delay ys)))))))))<br /> = {- delaying -}<br />new (fill (stream (appendD (delay (new (fill (filterS f (stream (delay xs)))))) (reverseD (delay ys)))))<br /> = {- appendD/delay.../1 -}<br />new (fill (stream (delay (new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))))))<br /> = {- recycling -}<br />new (fill (appendS (filterS f (stream (delay xs))) (stream (reverseD (delay ys)))))<br /></pre></code> <p><code>map</code> is an interesting function because it can work on <code>Delayed</code> arrays, <code>Stream</code>s, and <code>New<code>, but only if the function preserves types. Because of this, we have two cases: polymorphic and monomorphic. In the polymorphic case, the <code>mapS/stream</code> and <code>mapD/delay...</code> rules cycle through <code>Stream</code> and <code>Delayed</code>. In the monomorphic case, however, there is a different cycle, formed by <code>mapS/stream</code>, <code>mapD/delay... (monomorphic)</code>, and <code>transform/fill</code>.</p> <h2>The second phase: fixing smaller inefficiencies</h2><p>As it turns out, this system is provably optimal in the number of allocations. Unfortunately, allocations aren't the only thing determining performance. A system very similar to the one described above was rejected because of this. It describes the following example:</p> <code><pre><br />map (> 5) (map (+1) (xs // ps))<br /> = {- inlining -}<br />new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))<br /> = {- delaying -}<br />new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))<br /> = {- mapD/delay... (monomorphic) -}<br />new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))<br /></pre></code> <p>Although the result has the optimal two allocations, the two <code>map</code>s are executed in separate loops and so cannot be properly fused. To rectify this, I propose a two-phase system. In phase 0, the previously described system is run, eliminating all unnecessary allocations. In phase 1, we do a number of "fixup" transformations that undo some of the inward movement done in phase 0, like so:</p> <code><pre><br />{-# RULES<br /> "stream/mapD" [1] forall f d . stream (mapD f d) = mapS f (stream d)<br /> "delay.../mapS" [1] forall f s . delay (new (fill (mapS f s))) = mapD f (delay (new (fill s)))<br /><br /> "stream.../transform" [1] forall f n . stream (delay (new (transform f n))) = f (stream (delay (new n)))<br /><br /> "delay.../reverseN" [1] forall n . delay (new (reverseN n)) = reverseD (delay (new n))<br /><br /> "stream/appendD" [1] forall d1 d2 . stream (appendD d1 d2) = appendS (stream d1) (stream d2)<br /></pre></code> <p>Note that <code>map</code>, which can act equally efficiently on <code>Delayed</code> arrays and <code>Stream</code>s, simply moves outward, switching between the two representations in an effort to get out of the way of other transformations. However, <code>reverse</code> runs much more efficiently on <code>Delayed</code> arrays than on <code>New</code>, so only switches away from <code>New</code>. Similarly, <code>append</code> uses <code>Stream</code>s and even monomorphic uses of <code>map</code> try to avoid <code>New</code>. To see this working:</p> <code><pre><br />map (> 5) (map (+1) (xs // ps))<br /> = {- inlining -}<br />new (fill (stream (mapD (> 5) (delay (new (fill (stream (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps)))))))))))<br /> = {- delaying -}<br />new (fill (stream (mapD (> 5) (mapD (+1) (delay (new (update (fill (stream (delay xs))) ps))))))<br /> = {- mapD/delay... (monomorphic) -}<br />new (fill (stream (mapD (> 5) (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))<br /> = {- entering phase 1, stream/mapD -}<br />new (fill (mapD (> 5) (stream (delay (new (transform (mapS (+1)) (update (fill (stream (delay xs))) ps))))))<br /> = {- stream.../transform -}<br />new (fill (mapD (> 5) (mapS (+1) (stream (delay (new (update (fill (stream (delay xs))) ps))))))<br /></pre></code> <h2>Future Work</h2><p>While this system works fairly well, there are a number of ways in which it could be improved.</p> <h3>More representations</h3><p>While delayed arrays help fusion significantly, other representations could also be useful. For example, the expression <code>reverse (filter f xs)</code> is currently compiled to <code>new (reverseN (fill (filterS f (stream (delay xs)))))</code>, which allocates a filtered array then reverses it in place. Ideally, the system would simply write out the array in reverse, requiring no post-processing step. This could be accomplished with a representation similar to <code>Stream (Int, a)</code> which would specify not only what the elements were but where to write them.</p> <h3>Multiple cycles of representations</h3><p>The aforementioned reverse filling only works if the number of elements has a bound known before calculation. However, due to functions like <code>concatMap</code>, this is not necessarily true. To fix this, there would have to be a static distinction between the two types of <code>Stream</code>s. This would cause there to be two different ways of getting from a <code>Delayed</code> array to <code>New</code>, and so there would be multiple possible cycles of representations.</p> <h3>Commutivity optimizations</h3><p>The current system of fusion puts functions in a strict order, conservatively assuming that each one could do almost anything, and so cannot be reordered. However, this misses out on a lot of opportunities. For example, <code>filter f (reverse (filter g xs))</code> compiles to</p> <code><pre><br />new (fill (filterS f (stream (reverseD (delay (new (fill (filterS g (stream (delay xs))))))))))<br /></pre></code> instead of to the much more efficient <code><pre><br />new (fill (filterS f (filterS g (stream (reverseD (delay xs))))))<br /></pre></code> <h2>Proof of allocation optimality</h2><p>For those of you who are interested, here is my proof that the simple inward moving strategy produces the optimal number of allocations.</p> <p>First of all, we define an expression to be a finite rooted tree whose internal verticies are functions, associated with one of a finite set of representations. The leaves and the root are special, abstract nodes that are all associated with the same representation, what we will call the manifest representation. The root has exactly one child. Note that this disallows functions that return multiple results or take different representations as arguments.</p> <p>Now, assume there is some total order on the set of representations, with the manifest representation being the minimum. We say that an edge of an expression is an allocating edge if the child node's representation is greater than the parent node's representation. We say that the cost of an expression is the number of allocating edges in that expression.</p> <p>Additionally, every function is associated with a fixed, nonempty set of allowed representations. We say that an expression is valid if every function is associated with one of its allowed representations.</p> <p>Next we say that the inward optimization is the transformation from expressions to expressions that, at every node, sets its representation to the smallest allowed representation that is greater than or equal to the representations of its children, discounting the children that are larger than all allowed representations. This is done in a bottom up fashion, starting with the nodes just above the leaves, thereby making this transformation independent of the previous representations and so idempotent. Note that this is equivalent to the set of rewrite rules described above, as the rules that don't deal with allocations simply lower the representation of a node if the new representation is greater than or equal to that of all the children, and the aggressive rules that do deal with allocations simply raise the representation to the largest one allowed if a large child is detected.</p> <h3>Lemma 1: The inward optimization returns a valid expression</h3><p>Because the inward optimization, by definition, assigns a representation to every function that is one of its allowed representations, it returns a valid expression.</p> <h3>Lemma 2: The inward optimization does not increase the cost of an expression if the input was valid</h3><p>It suffices to prove that transforming a single node does not increase the cost of the expression, as the inward optimization simply transforms the tree one node at a time. Now, we can ignore the edges to the children that have a representation larger than all the allowed representation, as all valid assignments must make those edges allocating, and this optimization does not change that. The inward optimization sets the remaining edges to not be allocating, as, by definition, it picks a representation greater than or equal to the children's representations. Therefore, the only way for the inward transformation to increase the cost would be switching the parent edge to be allocating where previously no edges were allocating. However, if this was the input state, then there exists some allowed representation greater than or equal to those of the children and less than or equal to that of the parent, and the minimum allowed representation greater than or equal to those of the children, which the inward optimization picks, would satisfy this property.</p> <p>Alternatively, given the correspondence with the rewrite rules given earlier in this post and the fact that none of those rewrite rules increase the number of allocations, the repeated use of those rules and so the inward optimization doesn't increase the number of allocations.</p> <h3>The final proof: The inward optimization produces a minimum cost expression</h3><p>Consider a valid assignment of representations with minimum cost. Applying the inward optimization to it cannot increase the cost, and as we started with a minimum cost expression, it cannot decrease the cost. Therefore, the result must also be minimum cost. However, the inward optimization is independent of the starting representations, so applying the inward optimization to any tree with the same functions produces a minimum cost expression.</p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-83291950105166963102012-10-27T14:06:00.001-07:002012-10-27T14:06:17.306-07:00Disciple-style Regions in Haskell, Part 1<p>I've admired <a href="http://disciple.ouroborus.net/">ddc</a> for quite a while, in particular its region system. Unfortunately, it also has many annoyances, and one big one is complexity. To rectify this, I've created a system implementing regions in Haskell. </p><h2>The Essence of Regions</h2><p>At first we might adopt a definition of regions based on their use in memory management: they are areas where you can allocate memory and work with it, as show by the class below. <code><pre><br />-- WRONG!<br />class (Monad (Environment r)) => Region r where<br /> data Ref r :: * -> *<br /> type Environment r :: * -> *<br /><br /> newRef :: a -> Environment r (Ref r a)<br /> readRef :: Ref r a -> Environment r a<br /> writeRef :: a -> Ref r a -> Environment r ()<br /></pre></code>There are a couple of instances of this - consider ST, or IO, or STM. Unfortunately, this is not what we want. We need to support, for example, immutable regions. Not only that, but in rare cases we might want a write only region (as a random example, a password store). You might have regions that you can't create data in. In the end, we have reduced the idea of a region to something very small: a place with data. <code><pre><br />data family Ref r :: * -> *<br /></pre></code>Note that this also allows us to do away with the monad. Of course, mutation and reading and creation are common, so we should have some classes: <code><pre><br />class (Monad m) => Writable r m where<br /> writeRef :: a -> Ref r a -> m ()<br /><br />class (Monad m) => Readable r m where<br /> readRef :: Ref r a -> m a<br /><br />class (Monad m) => Creatable r m where<br /> newRef :: a -> m (Ref r a)<br /></pre></code>Following the previous idea of not needing a fixed monad, we do not restrict a reference to have a single monad assosciated with it. </p><h2>Some Examples</h2><p><code><pre><br />data Mut s<br />newtype instance Ref (Mut s) a = MutRef (STRef s a)<br /><br />instance Writable (Mut s) (ST s) where<br /> writeRef val (MutRef ref) = writeSTRef val ref<br /><br />instance Readable (Mut s) (ST s) where<br /> readRef (MutRef ref) = readSTRef ref<br /><br />instance Creatable (Mut s) (ST s) where<br /> createRef val = fmap MutRef $ newSTRef val<br /><br />data Immut<br />newtype instance Ref Immut a = ImmutRef a<br /><br />instance (Monad m) => Readable Immut m where<br /> readRef (ImmutRef val) = return val<br /><br />instance (Monad m) => Creatable Immut m where<br /> createRef val = return (ImmutRef val)<br /><br />data Atomic<br />newtype instance Ref Atomic a = AtomicRef (TVar a)<br /><br />instance Writable Atomic STM where<br /> writeRef val (AtomicRef ref) = writeTVar val ref<br /><br />instance Readable Atomic STM where<br /> readRef (AtomicRef ref) = readTVar ref<br /><br />instance Creatable Atomic STM where<br /> createRef val = fmap AtomicRef $ newTVar val<br /><br />instance Writable Atomic IO where<br /> writeRef val (AtomicRef ref) = writeTVarIO val ref<br /><br />instance Readable Atomic IO where<br /> readRef (AtomicRef ref) = readTVarIO ref<br /><br />instance Creatable Atomic IO where<br /> createRef val = fmap AtomicRef $ newTVarIO val<br /></pre></code>To support things like pointers or specialized references, we probably want to add the ability to restrict the types that can be put in a Ref, but this is beyond the scope of this post. </p><h2>Building Complex Data Structures</h2><p>As an example, I will build a cyclic doubly linked list with this framework. Actually constructing the representation and basic operations of a complex data structure is pretty easy: <code><pre><br />data DList r a = DList (Ref r (DList r a)) a (Ref r (DList r a))<br /><br />head :: (Monad m) => DList r a -> m a<br />head (DList _ x _) = return x<br /><br />tail :: (Readable r m, Writable r m) => DList r a -> m (DList r a)<br />tail (DList lRef _ rRef) = do<br /> l@(DList ll xl rl) <- readRef lRef<br /> r@(DList lr xr rr) <- readRef rRef<br /> writeRef lr l<br /> writeRef rl r<br /> return r<br /><br />singleton :: (Creatable r m, MonadFix m) => a -> m (DList r a)<br />singleton x = do rec<br /> l <- createRef result<br /> r <- createRef result<br /> let result = DList l x r<br /> return result<br /><br />-- And so on...<br /></pre></code>However, how do we switch between representations? With arrays, we usually have the freeze and thaw methods. Here, we have a possibly infinite collection of regions to switch between - how can we do it? I'll post my answer in part 2, since this post is getting a bit long. </p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0tag:blogger.com,1999:blog-133066169497036563.post-1691457237517924702012-10-12T15:33:00.000-07:002012-11-03T12:53:59.227-07:00Stream Fusion for Conduits<p>Recently I've been looking at Michael Snoyman's <a href="http://hackage.haskell.org/package/conduit/">conduit</a> package. Now, conduit is a pretty impressive library, but I was skeptical about its performance. If you look at its source, you see almost every function recursively going down the stack of conduit operations and creating a new one. Now only that, but deforestation is scarce, so it is hard to justify constantly rebuilding these trees. </p><p>To fix this, the first thing I tried was using the free monad implementation described <a href="http://comonad.com/reader/2011/free-monads-for-less-2/">here</a>. Unfortunately, this has a large problem: anything besides monad operations is not easy. Functions like <code>pipe</code>, which composes pipes, require building a structure and then traversing it. </p><p>After trying a number of similar methods (combining church encoding with Scott encoding, etc.), all of which were equally unsuccessful, I came upon the solution: Stream Fusion. </p><h2>Stream Fusion</h2><p><a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.7401">Stream Fusion</a> (this paper is very readable, by the way) is a techneque for removing unnecessary list or array traversals by using an inversion of control. The key definitions are as follows: <code><pre><br />data Step s a = Skip s | Yield s a | Done<br />data Stream a = forall s . Stream (s -> Step s a) s<br /><br />stream :: [a] -> Stream a<br />stream {-list-} = Stream next {-list-} where<br /> next [] = Done<br /> next (a : as) = Yield as a<br /><br />unstream :: Stream a -> [a]<br />unstream (Stream next s0) = go s0 where<br /> go s = case next s of<br /> Done -> []<br /> Skip s -> go s<br /> Yield s a -> a : go s<br /></pre></code>The <a href="http://hackage.haskell.org/package/stream-fusion">library on hackage</a> modifies these functions lightly to force constructor specialization to happen, but this is irrelevant. </p><p>Next we have a library full of functions on stream, like so: <code><pre><br />mapStream :: (a -> b) -> Stream a -> Stream b<br />mapStream f (Stream next s) = Stream next' s where<br /> next' s = case next s of<br /> Done -> Done<br /> Yield s' a -> Yield s' (f a)<br /> Skip s' -> Skip s'<br /><br />filterStream :: (a -> Bool) -> Stream a -> Stream a<br />filterStream p (Stream next s) = Stream next' s where<br /> next' s = case next s of<br /> Done -> Done<br /> Yield s' a<br /> | p a -> Yield s' a<br /> | otherwise -> Skip s'<br /> Skip s' -> Skip s'<br /></pre></code>The <code>filterStream</code> function here is especially interesting, as it shows how the <code>Skip</code> constructor can be useful - it allows <code>filterStream</code> to be non-recursive. This helps GHC's optimizer immensely. </p><p>Next, we implement a list library on top of this stream library: <code><pre><br />map :: (a -> b) -> [a] -> [b]<br />map f = unstream . mapStream f . stream<br /><br />filter :: (a -> Bool) -> [a] -> [a]<br />filter p = unstream . filterStream p . stream<br /></pre></code>Note that if we inline these definitions, we get things like this: <code><pre><br />map f . filter g . map h =<br /> unstream . mapStream f .<br /> stream . unstream .<br /> filterStream g .<br /> stream . unstream .<br /> mapStream h . stream<br /></pre></code>Notice how <code>stream . unstream</code> shows up a lot? Well, at least approximately, streams and lists are isomorphic, so we can add this rule to our library: <code><pre><br />{-# RULES<br />"stream/unstream" forall s . stream (unstream s) = s<br /> #-}<br /></pre></code>With this simplification, we now have only one <code>stream</code> and one <code>unstream</code> in out pipeline. This means that we now have one tight loop, instead of three, increasing performance. </p><h2>Applying This to Conduits</h2><p>So far, our discussion has only applied to lists. How do we do something like this for, say, a conduit? The answer lies near the end of the paper, in section 9.3, wchich brings up the example of a binary tree: <code><pre><br />data Tree a b = Leaf a | Branch b (Tree a) (Tree a)<br /></pre></code>To apply stream fusion to this type, we modify the <code>Step</code> type: <code><pre><br />data Step a b s = Leaf_s a | Branch_s b s s | Skip s<br />data Stream a b = forall s . Stream (s -> Step a b s) s<br /></pre></code>Moving on to conduits, we have what follows. <code><pre><br />data Step l i o u m r s =<br /> HaveOutput s (m ()) o<br /> | NeedInput (i -> s) (u -> s)<br /> | Done r<br /> | PipeM (m s)<br /> | Leftover s l<br /> | Skip s<br />data Stream l i o u m r = forall s . Stream (s -> Step l i o u m r s) s<br /></pre></code>The full code is available <a href="http://dl.dropbox.com/u/24141626/Stream.hs">here</a>. </p><h2>Performance</h2><p>As an optimization library, benchmarks are pretty improtant. I've only done preliminary testing, but <a href="http://dl.dropbox.com/u/24141626/report.html">here</a> are the results. Especially for something so quickly thrown together by someone who doesn't understand the nuances of SpecConstr, I'd say the results are very promising. </p>Jonathan S.http://www.blogger.com/profile/06880196625350213890noreply@blogger.com0