Home / Highlights / FAQ / Examples / Quick Start / Download Forum / Register / Login

More Optimal Standard Library

This article expect you have some knowledge of C and know little to nothing about Bolin.

This example is simple. We want to read a file line by line and convert each to an integer. To prevent the optimizer from removing all of our code we'll print the last digit of every number

Here's a one liner to create our input file

echo 'int main() { for (int i=0; i<10000000; i++) { printf("%d\n", i); } }' | clang -x c - && ./a.out > /tmp/numbers
Here's the c source. Bolin is twice as fast for this example. Why might this be?
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	int arraySize=4096;
	char*array = (char*)realloc(0, arraySize);

	FILE*f = fopen("/tmp/numbers", "rb");
	if (f == 0) {
		return EXIT_FAILURE;
	}

	char buf[32];
	int i=0;
	for(; ; i++)
	{
		if (fgets(buf, sizeof(buf), f) == 0) {
			break;
		}
		if (i+1 == arraySize) {
			arraySize *= 2;
			array = (char*)realloc(array, arraySize);
		}
		array[i] = (atoi(buf) & 7) + '0';
	}
	array[i++] = '\n';
	fwrite(array, 1, i, stdout);
}
Here's the Bolin source to compare with.
main()
{
	array := $u8[]
	r := OpenFile("/tmp/numbers")
	index := 0
	while !r.IsEOF()
	{
		if r.BufferLine()
			continue
		array.Push(r.Int() & 7 + '0')
		r.ExpectEOL()
		index++
	}
	if r.HasError {
		print("Invalid file")
		return
	}
	print(array)
}
Here's a C++ version. It's significantly slower so we'll compare the C code.

Here's how long they take using -O3, -O2 preformed nearly identical. 10 million is a very large loop. 1 cycle slower will show up as >1ms. Divide the time by 10 or 100 for a more realistic number.

$ gcc -O3 atoi_array_v1.c
$ time ./a.out | sha1sum 
9fb65214d11697bf66753db443edb017b7f08fdf  -

real	0m0.298s
user	0m0.279s
sys 	0m0.022s

$ clang -O3 atoi_array_v1.c
$ time ./a.out | sha1sum 
9fb65214d11697bf66753db443edb017b7f08fdf  -

real	0m0.294s
user	0m0.283s
sys 	0m0.014s

$ bolin -O3 atoi_array_v1.bolin
$ time ./a.out | sha1sum 
9fb65214d11697bf66753db443edb017b7f08fdf  -

real	0m0.143s
user	0m0.136s
sys 	0m0.010s

The first problem is atoi. It has a body in the headers so it is inlinable however it's a wrapper function that calls strtol, which does not have a body and isn't inlinable. Using objdump you can see strtol being called in main. Compiling with -flto doesn't inline it.

Let's try using our own atoi implementation.

#include <stdio.h>
#include <stdlib.h>

typedef unsigned long long u64;

//returns the amount of characters or -1 for error. 
//Digits only, this does not handle - or + at the start
int ParseInt(const char*sz, u64*out)
{
	if (!sz) return -1;
	u64 val=0; //overflow is legal with unsigned
	int i=0;
	while (1)
	{
		char c = sz[i];
		if ((unsigned)(c-'0') > 9u) break;
		u64 prevVal = val;
		val = val*10+c-'0';
		if (val <= prevVal && val != 0) {
			return -1;
		}
		i++;
	}
	if (i == 0)
		return -1;
	*out = val;
	return i;
}


int main(int argc, char *argv[])
{
	int arraySize=4096;
	char*array = (char*)realloc(0, arraySize);

	FILE*f = fopen("/tmp/numbers", "rb");
	if (f == 0) {
		return EXIT_FAILURE;
	}

	char buf[32];
	int i=0;
	for(; ; i++)
	{
		if (fgets(buf, sizeof(buf), f) == 0) {
			break;
		}
		if (i+1 == arraySize) {
			arraySize *= 2;
			array = (char*)realloc(array, arraySize);
		}
		//array[i] = (atoi(buf) & 7) + '0';
		u64 value;
		ParseInt(buf, &value);
		array[i] = (value & 7) + '0';
	}
	array[i++] = '\n';
	fwrite(array, 1, i, stdout);
}

How much faster is this?

$ time ./a.out | sha1sum 
9fb65214d11697bf66753db443edb017b7f08fdf  -

real	0m0.190s
user	0m0.182s
sys 	0m0.010s
That's a bit over 100ms faster. What's causing the 50ms difference between C and Bolin? The second problem is fgets. fgets and Bolin's BufferLine both have an internal buffer and both call the OS read function many times to fill it. The difference is the fgets function copies the line into your buffer while Bolin stream reader doesn't. You can ask the reader to return a mutable slice, or you can like in this example; ask for an integer which the stream reader will return and increment the internal position. You can see how much time fgets and BufferLine take in the linked examples

As mentioned earlier 10 million is not a realistic number and being 1 cycle slower will show up as 1 or more milliseconds. Most of the time these two problems won't make a difference. However, if you are in a hot loop the above is likely to improve preformance.