More Optimal Standard Library
This article expect you have some knowledge of C and know little to nothing about Bolin.
This example is simple. We want to read a file line by line and convert each to an integer. To prevent the optimizer from removing all of our code we'll print the last digit of every number
Here's a one liner to create our input file
seq 10000000 > /tmp/numbersHere's the c source. Bolin is twice as fast for this example. Why might this be?
#include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int arraySize=4096; char*array = (char*)realloc(0, arraySize); FILE*f = fopen("/tmp/numbers", "rb"); if (f == 0) { return EXIT_FAILURE; } char buf[32]; int i=0; for(; ; i++) { if (fgets(buf, sizeof(buf), f) == 0) { break; } if (i+1 == arraySize) { arraySize *= 2; array = (char*)realloc(array, arraySize); } array[i] = (atoi(buf) & 7) + '0'; } array[i++] = '\n'; fwrite(array, 1, i, stdout); }Here's the Bolin source to compare with.
main() { array := $u8[] r := OpenFile("/tmp/numbers") index := 0 while !r.IsEOF() { if r.BufferLine() continue array.Push(r.Int() & 7 + '0') r.ExpectEOL() index++ } if r.HasError { print("Invalid file") return } print(array) }Here's a C++ version. It's significantly slower so we'll compare the C code.
Here's how long they take using -O3, -O2 preformed nearly identical. 10 million is a very large loop. 1 cycle slower will show up as >1ms. Divide the time by 10 or 100 for a more realistic number.
$ gcc -O3 atoi_array_v1.c $ time ./a.out | sha1sum 7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306 - real 0m0.298s user 0m0.279s sys 0m0.022s $ clang -O3 atoi_array_v1.c $ time ./a.out | sha1sum 7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306 - real 0m0.294s user 0m0.283s sys 0m0.014s $ bolin -O3 atoi_array_v1.bolin $ time ./a.out | sha1sum 7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306 - real 0m0.143s user 0m0.136s sys 0m0.010s
The first problem is atoi. It has a body in the headers so it is inlinable however it's a wrapper function that calls strtol, which does not have a body and isn't inlinable. Using objdump you can see strtol being called in main. Compiling with -flto doesn't inline it.
Let's try using our own atoi implementation.
#include <stdio.h> #include <stdlib.h> typedef unsigned long long u64; //returns the amount of characters or -1 for error. //Digits only, this does not handle - or + at the start int ParseInt(const char*sz, u64*out) { if (!sz) return -1; u64 val=0; //overflow is legal with unsigned int i=0; while (1) { char c = sz[i]; if ((unsigned)(c-'0') > 9u) break; u64 prevVal = val; val = val*10+c-'0'; if (val <= prevVal && val != 0) { return -1; } i++; } if (i == 0) return -1; *out = val; return i; } int main(int argc, char *argv[]) { int arraySize=4096; char*array = (char*)realloc(0, arraySize); FILE*f = fopen("/tmp/numbers", "rb"); if (f == 0) { return EXIT_FAILURE; } char buf[32]; int i=0; for(; ; i++) { if (fgets(buf, sizeof(buf), f) == 0) { break; } if (i+1 == arraySize) { arraySize *= 2; array = (char*)realloc(array, arraySize); } //array[i] = (atoi(buf) & 7) + '0'; u64 value; ParseInt(buf, &value); array[i] = (value & 7) + '0'; } array[i++] = '\n'; fwrite(array, 1, i, stdout); }
How much faster is this?
$ time ./a.out | sha1sum 7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306 - real 0m0.190s user 0m0.182s sys 0m0.010sThat's a bit over 100ms faster. What's causing the 50ms difference between C and Bolin? The second problem is fgets. fgets and Bolin's BufferLine both have an internal buffer and both call the OS read function many times to fill it. The difference is the fgets function copies the line into your buffer while Bolin stream reader doesn't. You can ask the reader to return a mutable slice, or you can like in this example; ask for an integer which the stream reader will return and increment the internal position. You can see how much time fgets and BufferLine take in the linked examples
As mentioned earlier 10 million is not a realistic number and being 1 cycle slower will show up as 1 or more milliseconds. Most of the time these two problems won't make a difference. However, if you are in a hot loop the above is likely to improve preformance.