Home / Highlights / FAQ / Examples / Quick Start / Download Forum / Register / Login

How Does Bolin Compile and How Fast Is It?

Numbers are at the end.

While designing, a goal has been to create a language an expert would want to use. Experts tend to work on complex projects that have millions of lines of code. It's no fun changing a header and waiting 15 minutes to have all the affected files rebuild, a full rebuild is even less so. Because of this we were focused on having a fast compilation time. Being able to compile in less than 200ms will feel instantaneous. Compiling in less than 1 second would likely prevent users from checking messages and losing their train of thought.

We have no numbers in mind. All we know is we'll need multiple cores, meaning multiple threads. Here's a very high level overview of how Bolin compiles.

At launch, we check the command line and launch a thread for every file. Limiting threads to hardware threads is a good idea but hasn't been done since we haven't optimize much.

Every thread does the following

1. Open/Read a file, then lex it. The lexing process converts ">>" into TOKEN_RIGHT_SHIFT so the next step doesn't need to figure out if two '>' should be two greater than or a right shift.

2. Build an abstract syntax tree. For those who are unfamiliar this step is what gives code it's shape. For example "main() { if false {} nine=9 }" creates a function node for "main", marks the function parameters as empty, puts an if node in the body along with a variable node after it. The false and curly braces are part of the if node, the 9 being part of the variable node.

3. The naming pass. Some compilers may do this in the previous step but we do not. In the naming pass we collect the names of everything. Functions, types, global constants etc.

4. The typing pass. It figures out the type of functions, the type of variables (global and inside a struct), the size of a struct, the values in enums, etc. For the most part this handles everything outside of a function.

At the end of this pass the thread joins the main thread. The main thread will repeat this pass if any type is incomplete. This is usually the case when file A has a type that depends file B which in turn depends on a type in file A.

There's more that could be multi threaded, but we looked at a profiler and saw 79% of the work (not including LLVMs part) is done in threads so for now we stopped here.

The next pass is the body pass. At this point we know the size, alignment and methods of each type. The compiler has enough knowledge to figure out if variables should be passed by value or by reference, if it should put a variable (or array) on the stack or on the heap. Here is where the compiler starts processing the function body.

This pass generates a very simple intermediate representation for each function. The IR isn't byte code, it has many pointers and it isn't flat nor a graph. The IR isn't language neutral like the classic definition. Some instructions in this IR are:

When the pass sees a while node, it will use a loop instruction which takes in the loop conditional and the loop body. When the pass sees a for to node it'll use the create variable instruction to create a loop counter, generate the conditional, generate the body, create an increment at the end of the body then create a loop instruction with the previously made conditional and body. Because the syntax is broken down the later parts have less cases to deal with. During this pass strings, arrays and other things may be added to a list of constants.

Finally the code generation pass. This is where we convert our IR into LLVM IR. We plan on having multiple backends and didn't want to require llvm or clang to be installed so we didn't use llvm-ir directly. Because we didn't want to dynamically link llvm nor deal with their API we generate the IR in text form. We then launch clang (not a typo) and ask clang to compile llvm ir and link to our runtime object file. clang compiles this into an executable.

The Numbers

We tried in C, Go and Bolin. We don't like using number of lines as a measurement. Many people copy/paste code until it reach the lines they want but many lines are empty or contain whitespaces and curly braces. Our number will be 1 million array assigns. We had people question if array bounds checking will be slow so we're measuring with array assigns. We use this file to generate the code in C, Go and Bolin. These numbers are from a Ryzen 5 3600. It has 6 core, 12 threads.

First up, C using clang 14.0.6

$ python ./gen.py c 1000000 24
$ time clang file*.c

real	0m12.217s
user	0m11.800s
sys 	0m0.394s
That much faster than I was expecting. 12 seconds for 1 million array assigns. Perhaps long compiles times are due to templates.

People say Go has the fastest compiler. Go version 1.19

$ python ./gen.py c 1000000 24
$ time go build file*.go

real	0m11.537s
user	0m27.438s
sys 	0m1.093s
Strange, it's nearly identical. Only 1 second faster but is multithreaded. Clang uses one thread. Looking at htop it appears all of go threads were idle for a few seconds then suddendly used all cores. A million assigns in nearly 10 seconds is quite good.

Finally Bolin. As a quick recap. Bolin processes ~80% of it's work in threads, generates llvm-ir in text format, launches clang which would have to parse the text and compile it into an executable.

$ python ./gen.py bolin 1000000 24
$ time bolin file*.bolin

real	0m8.645s
user	0m9.892s
sys 	0m0.507s
Well, that surpising. We have no idea why it is faster. We compiled the C source into llvm ir using `clang -S -emit-llvm`. Then we tried compiling it wondering if it will have a speed boost. The boost appears to be < .5 seconds faster than using C files directly. Since -g is not specified theres no debug data in the IR, bolin always include it so bolin is generating more text data. `ls -lhr` shows clang and bolin's binary are both 9.7M (Go binary is 15M). Noone on the team knows clang's internals, so we can't speculate why bolin IR managed to compile a few seconds faster.

A thought...

How fast would the compile speed be if went straight to binary? That would be a lot of work. We planned on generating C headers so people can easily call our functions from C code. What if we got started on that and made enough of a C backend to run this test? If you downloaded the compiler and noticed tcc_static.o you may have known that is exactly what we did.

$ time bolin file*.bolin -tcc

real	0m1.192s
user	0m2.504s
sys 	0m0.383s
That 838K array assigns per second.

I wonder, what would happen if we ran this on a machine that had faster ram and a larger cache? I asked around and a friend said he has a AMD Ryzen 9 5950X which has 16 cores and 32 threads. His numbers are

real	0m0.986s
user	0m1.915s
sys 	0m0.371s
That's over a million per second.