Hi JP, Igor,
This is an interesting thread, and great to see it working and some assembly optimisations! It also seems like a good opportunity to bang a little drum to highlight the fact that often writing code in C can give equally good results.
My argument would generally be, if you can get nearly as good results in C, you should write it in C. It is more understandable, maintainable and portable, and much less prone to bugs. The compiler can do all sorts of admin on our behalf (register allocation, function entry, exit, type checking etc), but also apply lots of optimisations. Writing in assembly can often be slower and make you focus on a very small window, so you don't step back nor want to rework it at a higher level. So here is a by-example walkthrough, based on your code.
Here is the C equivalent of the asm routine, but actually written in a more high-level expressive way. Note for example, every loop is expressing the bit extraction as a function of i:
#define L1 0x040000
#define L2 0x100000
#define L3 0x200000
#define L4 0x800000
#define ALLLEDS (L1 | L2 | L3 | L4)
const uint32_t masks[] = {L1, L2, L3, L4};
void binc(int value) {
LPC_GPIO1->FIOCLR = ALLLEDS;
value += 1;
for(int i=0; i<4; i++) {
if((value >> i) & 1) {
LPC_GPIO1->FIOSET = masks[i];
} else {
LPC_GPIO1->FIOCLR = masks[i];
}
}
}
Now, how much do you pay for writing it in C?
Timing Assembly
- 59377 us
Timing C translation
- 56252 us
You get faster code!
But the real advantage of C is how quickly you can optimise. And by this I don't mean doing the compilers job, I mean expressing the problem in a better algorithmic or functional way.
For example, writing to peripheral memory-map registers will be slower than internal core registers. But the compiler has to honour every peripheral read and write you express in your code; it can't optimise them away. So why not minimise the writes to the peripheral registers by packaging up the things you want to write:
// creating the mask values before writing them to registers
void binc2(int value) {
value += 1;
uint32_t set = 0;
uint32_t clr = 0;
for(int i=0; i<4; i++) {
if((value >> i) & 1) {
set = masks[i];
} else {
clr = masks[i];
}
}
LPC_GPIO1->FIOSET = set;
LPC_GPIO1->FIOCLR = clr;
}
Now we get:
Timing Assembly
- 59377 us
Timing C translation
- 56252 us
Timing C reg-writes outside the loop
- 53126 us
Faster still! Now, a standard trade-off you might make is a space-time trade-off. i.e. I spend code/data space in return for faster execution. In this example, i'll use data space by creating a lookup of my LED translations:
const uint32_t masks2[] = {
L1,
L2,
L2 | L1,
L3,
L3 | L1,
L3 | L2,
L3 | L2 | L1,
L4,
L4 | L1,
L4 | L2,
L4 | L2 | L1,
L4 | L3,
L4 | L3 | L1,
L4 | L3 | L2,
L4 | L3 | L2 | L1
};
// using a full lookup for the masks (space vs time tradeoff)
void binc3(int value) {
uint32_t set = masks2[value];
uint32_t clr = ~masks2[value] & ALLLEDS;
LPC_GPIO1->FIOSET = set;
LPC_GPIO1->FIOCLR = clr;
}
And the results from this?
Timing Assembly
- 59377 us
Timing C translation
- 56252 us
Timing C reg-writes outside the loop
- 53126 us
Timing C with mask lookup table
- 15626 us
So this algorithmic change just bought us a big speedup. There are probably others too, or better ways to do it. In some ways this is moving away from expressing the problem to expressing how to solve it, but is a classic space-time translation so the code is actually pretty clear.
Finally, I then just want to show a last thing. In the code, i've been calling it as you have using i % 16
. Whilst I realise this is just a test, it highlights that efficiencies can be gained in lots of places. By changing it to i & 0xF
, the compiler doesn't have to do a full modulo calculation:
START("C with mask lookup table");
for(int i=0; i<LOOPS; i++) {
binc3(i % 16);
}
STOP();
START("C with mask lookup table, but caller loop optimised");
for(int i=0; i<LOOPS; i++) {
binc3(i & 0xF);
}
STOP();
So here are the final results:
Timing Assembly
- 59377 us
Timing C translation
- 56252 us
Timing C reg-writes outside the loop
- 53126 us
Timing C with mask lookup table
- 15626 us
Timing C with mask lookup table, but caller loop optimised
- 12501 us
So that shows around 4-5x speedup. And in C :) For reference, the whole code is here:
Examples of C vs ASM optimisation
This is not really meant to be specific to your example (some of what I've done may not be applicable to your system/constraints), and in fact the code may not even be quite equivalent/bugfree, but hopefully highlights that a compiler does a very good job. It worries about a load of the housework as well as optimisations, which means you can spend more time thinking about how to map a problem.
Hope that was useful!
Simon
Hello everyone! I'm currently working on an assembly procedure that changes the LEDs on the mbed according to the r0 parameter and I almost have it working. The only hiccup is... If i call the procedure from my main.cpp inside a loop it doesn't work. It calls the procedure once and it escapes the loop.
Here is the code...
fb_ch_color.s
main.cpp
Any help would be greatly appreciated!