More Optimal Standard Library

This article expect you have some knowledge of C and know little to nothing about Bolin.

This example is simple. We want to read a file line by line and convert each to an integer. To prevent the optimizer from removing all of our code we'll print the last digit of every number

Here's a one liner to create our input file

seq 10000000 > /tmp/numbers

Here's the c source. Bolin is twice as fast for this example. Why might this be?

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	int arraySize=4096;
	char*array = (char*)realloc(0, arraySize);

	FILE*f = fopen("/tmp/numbers", "rb");
	if (f == 0) {
		return EXIT_FAILURE;
	}

	char buf[32];
	int i=0;
	for(; ; i++)
	{
		if (fgets(buf, sizeof(buf), f) == 0) {
			break;
		}
		if (i+1 == arraySize) {
			arraySize *= 2;
			array = (char*)realloc(array, arraySize);
		}
		array[i] = (atoi(buf) & 7) + '0';
	}
	array[i++] = '\n';
	fwrite(array, 1, i, stdout);
}

Here's the Bolin source to compare with.

main()
{
	array := $u8[]
	r := OpenFile("/tmp/numbers")
	index := 0
	while !r.IsEOF()
	{
		if r.BufferLine()
			continue
		array.Push(r.Int() & 7 + '0')
		r.ExpectEOL()
		index++
	}
	if r.HasError {
		print("Invalid file")
		return
	}
	print(array)
}

Here's a C++ version. It's significantly slower so we'll compare the C code.

Here's how long they take using -O3, -O2 preformed nearly identical. 10 million is a very large loop. 1 cycle slower will show up as >1ms. Divide the time by 10 or 100 for a more realistic number.

$ gcc -O3 atoi_array_v1.c
$ time ./a.out | sha1sum 
7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306  -

real	0m0.298s
user	0m0.279s
sys 	0m0.022s

$ clang -O3 atoi_array_v1.c
$ time ./a.out | sha1sum 
7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306  -

real	0m0.294s
user	0m0.283s
sys 	0m0.014s

$ bolin -O3 atoi_array_v1.bolin
$ time ./a.out | sha1sum 
7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306  -

real	0m0.143s
user	0m0.136s
sys 	0m0.010s

The first problem is atoi. It has a body in the headers so it is inlinable however it's a wrapper function that calls strtol, which does not have a body and isn't inlinable. Using objdump you can see strtol being called in main. Compiling with -flto doesn't inline it.

Let's try using our own atoi implementation.

#include <stdio.h>
#include <stdlib.h>

typedef unsigned long long u64;

//returns the amount of characters or -1 for error. 
//Digits only, this does not handle - or + at the start
int ParseInt(const char*sz, u64*out)
{
	if (!sz) return -1;
	u64 val=0; //overflow is legal with unsigned
	int i=0;
	while (1)
	{
		char c = sz[i];
		if ((unsigned)(c-'0') > 9u) break;
		u64 prevVal = val;
		val = val*10+c-'0';
		if (val <= prevVal && val != 0) {
			return -1;
		}
		i++;
	}
	if (i == 0)
		return -1;
	*out = val;
	return i;
}


int main(int argc, char *argv[])
{
	int arraySize=4096;
	char*array = (char*)realloc(0, arraySize);

	FILE*f = fopen("/tmp/numbers", "rb");
	if (f == 0) {
		return EXIT_FAILURE;
	}

	char buf[32];
	int i=0;
	for(; ; i++)
	{
		if (fgets(buf, sizeof(buf), f) == 0) {
			break;
		}
		if (i+1 == arraySize) {
			arraySize *= 2;
			array = (char*)realloc(array, arraySize);
		}
		//array[i] = (atoi(buf) & 7) + '0';
		u64 value;
		ParseInt(buf, &value);
		array[i] = (value & 7) + '0';
	}
	array[i++] = '\n';
	fwrite(array, 1, i, stdout);
}

How much faster is this?

$ time ./a.out | sha1sum 
7bdc44e9fb6f7ec60b3b9d5117ac6052b5eb9306  -

real	0m0.190s
user	0m0.182s
sys 	0m0.010s

That's a bit over 100ms faster. What's causing the 50ms difference between C and Bolin? The second problem is fgets. fgets and Bolin's BufferLine both have an internal buffer and both call the OS read function many times to fill it. The difference is the fgets function copies the line into your buffer while Bolin stream reader doesn't. You can ask the reader to return a mutable slice, or you can like in this example; ask for an integer which the stream reader will return and increment the internal position. You can see how much time fgets and BufferLine take in the linked examples

As mentioned earlier 10 million is not a realistic number and being 1 cycle slower will show up as 1 or more milliseconds. Most of the time these two problems won't make a difference. However, if you are in a hot loop the above is likely to improve preformance.