Wednesday, April 17, 2013

C and C++ need inline directives at the call site


I don't like the inline keyword, I think it's incredibly broken.

Inlining heuristics make reasonable assumptions about what should be inlined but there's basically no control by the user unless they use something like VC++'s __forceinline directive. Automatic inlining works well with trivial functions, the kinds that populate the cctype/ctype.h header, but fails when the desired inline function grows beyond a few statements.

The ability to inline code can be quite a powerful for developers, especially those who had to develop for consoles like the Xbox 360 and PS3, but because the compiler has incomplete information on where the inlining really needs to go and because there's no way in the language to give it better hints, there is often bloat in areas where you don't want it, and no inline expansion in areas where you do.

I've worked around this in the past by abusing macro expansion, but it's not a very clean solution, and makes debugging difficult.

What I would like to see would be an ability to use the inline declaration at the call site so that I can pick and choose where the inline expansion happens as needed. One example I can think of is multiplying two matrices in a video game. In some cases, say in creating the view/projection matrix, an inline expansion of the multiply would be unnecessary code bloat because this operation only typically happens once per game frame. 

Mat4x4 viewProjection = Mat4x4Multiply(view, projection);

While transforming joint hierarchies for animation there are typically a number of multiplications that happen in a tight loop where inlining would be beneficial:

for (int i = 0; i < numBonesInModel; ++i) {
    Mat4x4 transformedJoint = inline Mat4x4Multiply(jointMatrix[i], jointMatrix[parent[i]]);
}

I want more control over what gets inlined and what doesn't, without resorting to macros, and I think being able to specify at the call site what I want inlined would go a long way toward that goal.

Sunday, March 31, 2013

Scheme in 5000 lines of C part 5: Collecting garbage

I was hoping to get this post up sooner but the last week and a bit has been complicated by my appendix exploding. I'm on the upswing though and now am starting to think well enough to be able to dig through my Scheme todo list.

Last time this series left off I began working on the garbage collector by adding a test to exhaust the heap and then re-wrote the evaluation engine. So I didn't actually do any work on the garbage collector.

A few years ago I worked on a small team that built a .NET environment for video game consoles which helped spark my interests in compilers and runtime environments. We tried a number of types of garbage collectors while trying to hit our performance goals, from simple non-precise mark/sweep, to Boehm, to several different types of precise collectors. Two designs we had the most success with, one was a half-space (Cheney) collector, and the other final design was based on an algorithm somewhat similar to Clinger's non-predictive generational collector.

At points I have experimented with a generational collector but I found that they work best when you can expand the tenured generation as objects are copied into it. As this isn't really feasible in a video game console where your memory budget is fixed and there's no pagefile to fall back on they never made the cut.

What I found while profiling with the games that were being built on this .NET runtime is that objects either died really quickly or they persisted forever (ie: more than a few frames). By breaking the heap into many small (typically page-sized) buckets, often when a collect was triggered there would be enough buckets that no longer had any live objects that the collect process could finish without having to move/compact the heap at all. This was a huge advantage to the half-space collector that we had used previously because all live objects would be copied in every collection.

What does this have to do with Scheme in 5000 Lines of C? The collector used in this scheme is going to use some similar ideas. For example, it's going to use small, fixed sized buckets. It's also going to compact the heap only if it has to. Luckily this garbage collector doesn't have to worry about the complications that a collector for .NET needs, like finalizers and threading.

I've got the initial root-set scan and collection of empty buckets implemented. The object handles I mentioned last time are now quite important because they act as read-barriers for code outside of the system so that they get an object's correct location if it moves underneath them. It's a bit cumbersome to use in C for now but this is a trade-off I'm making in order to make the collector fast, and enforce to the user (me right now) that it's a bad idea to peek and poke stuff in the scripting system willy-nilly.

One of the most hair tearing-out-ly problems I've found when writing a precise garbage collector is making sure that you do not miss any roots. If you do you'll eventually collect an object's space before it's dead and then you're in world of hurt. The read barrier is an easy way to make sure you do not miss any roots if you do not have the luxury of being able to control code generation so as to generate stack frame maps.

The reader also got a lot of attention today. Sorry, not you :), but the core of the code called when the user calls (read ...).I've weaned it off of the CRT allocation functions and it now allocates all of its intermediate structures in the GC heap. This has been on my todo list for a while.

The source tree was also reworked so as to be a little more friendly for use as a library -- a public include folder was created separate from the library source code.

I've got a few things up next to take care of. I want to start working on closures soon as well as heap compaction. Maybe before the end of the month I can get a few of the Computer Language Benchmark Game benchmarks written and added to the test suite as well.

View the source on GitHub: https://github.com/maxburke/evilscheme

Sunday, March 17, 2013

Scheme in 5000 lines of C part 4: (Un)necessary refactorings

Last week I added the first test of the garbage collector which simply allocated memory until the heap was exhausted. Since the garbage collector doesn't do anything now but allocate it was expected that this test would run until the gc_collect function was called where it would hit a breakpoint. The test was simple:


(begin
    (define gc-test (lambda (count param)
        (if (< count 1048576)
            (gc-test (+ 1 count) #(1 2 3 4 5))
            "done")))
    (gc-test 0 0))

It calls itself a million times and allocates a 5-element vector each time. Since the test harness only uses a 1mb heap it should exhaust pretty quickly. (The compiler is pretty stupid currently and doesn't do any sort of optimization that would eliminate this allocation currently).

I ran into one issue first. Up until now, all my tests were testing compilation, so one test case would define a test function and the next would call it. Now, I wanted to do both in a single test case because I'm lazy and don't want to split all my test cases into one where definition happens and another where execution happens.

The system didn't have any ability to handle the (begin ...) syntax. This isn't the end of the world because (begin ...) is pretty easy to implement -- you loop over all the contained forms, evaluating each one, and the result is the value of the last form evaluated -- but it needed to be implemented in two places. Most primitive syntax features like define/set/car/cdr/etc. have an implementation for when they are encountered by the compiler, and an implementation in C for when they are evaluated at the top level. So,

(define foo 3)

and

((lambda () (define foo 3))

would follow two separate implementations. This stemmed from the early roots of "let's see if this will actually work", before there was a compiler.

But I was feeling lazy. I could have implemented begin for both both C code and the VM, but I decided to refactor the evaluation engine so that in the future I only need to implement functionality once. This was done by wrapping every top level function evaluation in an anonymous lambda and ensuring that is run by the VM. The example of (define foo 3) above would be translated into the form below and evaluated.

There were a few other changes that came out of this refactoring. First, procedure object storage is now broken into two parts, a record that stores meta information for the function, the environment in which the function was created, and some function local variable slots. These slots are not local variables in the sense of temporaries during execution but rather storage for objects that the function itself might need to refer to, such as other procedure objects (for closures), raw objects from the parser (ie: from (quote ...)), etc. I think this will make it much easier to tackle.

Second, I had to recognize the eventuality that I'd have to call C functions from the VM and that this new execution model would require it for things like (lambda ...).

Thirdly, I've introduced an object handle. This is meant to be a container that is tracked by the garbage collector so that, if a collection is triggered between two successive calls. Say, for example, you have this code:

struct object_t *foo = gc_alloc(...);
struct object_t *bar = gc_alloc(...);

And the second call to gc_alloc() triggers a collection because the heap was full. If the collector doesn't know about foo and the storage behind that object moves, we're going to be spending weeks finding out what has happened. The handle type is meant to make this more robust by basically placing a read barrier on that reference.

Next up is garbage collection.

(The project is now officially over 6000 lines of content, though only 4200 lines of that is code. That still counts, right?)

Follow on to part 5: Scheme in 5000 lines of C part 5: Collecting garbage
View the source on GitHub: https://github.com/maxburke/evilscheme

Monday, March 11, 2013

Scheme in 5000 lines of C part 3: Testing, random thoughts on closures.

I haven't made as much progress on the Scheme recently as I would have liked. Work's been busy, the house has needed some fixing, and I'm still chewing over how I want to implement closures.

Since I want to make *some* progress I rewrote the testing framework. Before I just had some sample strings embedded in main that I read-evaled-printed, and looked at the output to see if the changes I made had the desired effect.

I felt some (a lot of) shame because (semi-)robust testing is supposed to be my day job. So I moved most of that away into a test harness that's pretty stupid/simple but actually performs validation. Funny that.

There's a directory, tests/, that contains the test cases. One per file. Each file has the input that is fed to read-eval-print, the output of which is captured, and then compared to the expected value.

I'm still not quite sure how I'm going to handle closures though. Variable capture I think is going to be fairly easy. I'm not quite sure how I'm going to store references to inner functions though. For example, this function:


(define closure-test
  (lambda (foo)
    (lambda ()
      (set! foo (+ 1 foo))
      foo)))

The outer function (closure-test) will need to hold some sort of managed reference to the inner function so that when closure-test is called it can create a new function object and a new environment, but I'm not quite sure if I want to bake the reference into the bytecode, or do something else. Maybe re-write function objects to include a table of contained functions? That might work.

Follow on to part 4: Scheme in 5000 lines of C part 4: (Un)necessary refactorings
View the source on GitHub: https://github.com/maxburke/evilscheme

Thursday, February 7, 2013

Scheme in 5000 lines of C part 2: (let ((them (eat 'cake))))

I've spent the last week implementing the Scheme binding constructs let and let* for my little toy Scheme (https://github.com/maxburke/evilscheme). In Scheme, as with Common Lisp, the difference between let and let* is that let* ensures the bindings are evaluated left-to-right whereas for let they are unspecified. So, if you have let*, you should have let. Easy peasy! (If I'm wrong please let me know!)

The binding constructs surprisingly easy to implement. The VM which started out as purely stack-based (ie, (+ 1 1) became PUSH 1, PUSH 1, ADD) now has instructions for loading from, and storing to, basically arbitrary locations on the stack. The compiler now creates a spot for the active local variables and loads from/stores to those stack slots as required. The slots then are recycled as scopes are exited/entered. On my list for future refactoring is merging the opcodes used to load from/store to stack slots with the opcodes used to load from/store to argument slots because it's the same thing now.

The compiler -- so far -- is sort-of-multi-pass. It does a single pass over the source AST and does a rough compilation. After it performs some clean-up passes on the generated bytecode.

One pass removes the nops that are inserted into the bytecode stream as a result of branching. One difficulty I ran into when compiling conditional branches is that in many cases the branch targets weren't yet known, and when they were known it was a lot of messy code to find where the actual branch target was. The workaround I picked was to insert nops as branch targets, and then use these nops as the first instruction of the resulting basic blocks. It sounds kinda convoluted but it was much cleaner and straight forward to implement. This nop-removal pass, well, removes the nops and updates the branch instruction targets accordingly.

Another pass inspects all branch targets and if the instruction is a return, it replaces the branch with a return instruction. This is the only optimization pass. So far.

The last pass promotes all tail calls to use the VM's tailcall opcode. Scheme has a requirement that implementations must be tail recursive and, although I'm not gunning for 100% standards adherence, I wondered how difficult it would be to add. This pass takes calls that could be tail calls (basically any instruction sequence of a call proceeded immediately by a return) and replaces it with a tail call. The VM then ensures that these tail calls are executed in constant space.

I'm going to tackle closures next because it's been something on my want-to-do list for a while. After that I'll probably tackle the garbage collector, alternative lambda syntax (ie, (define (foo a b c)) in addition to (define foo (lambda (a b c)))), interactive REPL + error callbacks instead of just assertions, benchmarking, and proper quasi-quotation. I'm not sure what I'll do with continuations, that part of the spec I may skip entirely because I'm too dumb to understand how call/cc actually works, but some sort of co-routine mechanism will probably end up going in.

Follow on to part 3: Scheme in 5000 lines of C part 3: Testing, random thoughts on closures.
View the source on GitHub: https://github.com/maxburke/evilscheme

Monday, January 28, 2013

Scheme in 5000 lines of C

Every time I see a blog post spring up talking about a new (basic) implementation of Scheme that is written in 30 lines of Ruby, Javascript, or Python I get the sense that there's something missing.

In fact, there's a lot that's missing.

These environments already come with a garbage collector, so the implementation never describes implementation. The intricacies and tradeoffs of object storage? Also skipped over. Parsing? Nope. What about compilation? Never done.

I appreciate code golf and it's fun to see these new, super small, implementations show up, and don't mean to belittle the effort their developers invested, however I haven't been able to learn much from them. So, I thought I'd give it a go on my own.

The hard way.

In C.

Not that cutesy C with C++ comments, variable-declaration-wherever-you-want-it, or inline functions, but the super crusty C89 supported by Microsoft Visual C++, the one where Dave Cutler personally wrote the front end himself while fighting off the red menace and arm wrestling bears. (And GCC/clang too... I guess. Twist my rubber arm.)

I've got a bunch of stuff cobbled together now where it sorta kinda works. It parses, it generates an AST, it compiles the AST into a bytecode, and the bytecode runs. Hey, that almost makes it more functional than some video game console toolchains I've used.

One decision I made recently which simplified the VM conceptually was creating value types (fixnum/flonum/bool/char/reference/inner reference) from reference types (cons cell/vector/string/function). This allowed the VM's execution stack to contain either a raw value, like a fixnum, or a reference to a heap-allocated object, making stack operations simple.

Originally I had it so that the storage of conses was handled by holding a pair of pointers to other objects but I recently switched this so that a cons is now a vector of length two. This means that value types do not need to be boxed in order to be stored in a cons and also simplifies the operations on it.

I've got a few compiler-related tasks I'm going to work on now that it's executing, such as handling specific essential syntax forms. I've got a garbage collector to do as well, but I'm not worried about that one as I've already written several, including one that shipped in a console video game.

If you want to follow along, I've got my code stashed on GitHub. It's still pretty rough around the edges though, so beware at this point!

Maybe when (or if!) I finish, it'll actually be under 5000 lines of C :-)

Follow on to part 2: Scheme in 5000 lines of C part 2: (let ((them (eat 'cake))))