So I timed a short program with /Gd and /Gr

StopWatch timings
Image by Michal Jarmoluk from Pixabay

This was the follow up to yesterday’s post about seeing if changing the function calling convention, switching from stacked parameters to passing them in registers made a difference in execution time.

This was the program I used.

#include <stdio.h>
#include "hr_time.h"

int add(int a, int b, int c,int d,int e) {
	return a - b * 2 + c * 3 + d * 3 + e * 5;
}

int __cdecl main() {
	int total=0;
	stopWatch s;
	startTimer(&s);
	for (int i = 0; i < 10000000; i++) {
		total += add(i, 5, 6, i, 8);
	}
	stopTimer(&s);
	printf("Value = %d Time = %7f.5\n",total, getElapsedTime(&s));
}

Pretty similar to the one I did yesterday except with two more parameters in the add function and my Windows high-res timing code. I’ve extracted the two timing files (hr_time.h/.c) from the asteroids and it’s in the LearnC folder on GiHhub.

As before this was compiled as x86. Also I tried it first compiled as release. This means the optimizing compiler has its way and I got virtually identical for cdecl (/Gd), fastcall (/Gr) and even safecall (/Gz).

Disassembly of the machine code revealed that the optimizer had moved the function code inline in the for loop and this negated the call code. So I did it again in debug mode. Here there was a clear difference. The times for fastcall were 0.259 while the cdecl (the default) was 0.239 which is about an 8% speed increase. Safecall was roughly the same execution as cdecl. So the lesson seem to be don’t use fastcall.

I think I need a more complicated program which should be compiled in release mode but where optimization doesn’t transform the function into inline code. Perhaps making the function longer would do it so the function machine code would be too long to fit in a L1 cache.

Interestingly the release code execution time was 0.005557 seconds, almost 50 x faster than the debug time.