Assembly Procedure that changes LEDs according to int value. Almost works.

# 27 Jan 2011

Hello everyone! I'm currently working on an assembly procedure that changes the LEDs on the mbed according to the r0 parameter and I almost have it working. The only hiccup is... If i call the procedure from my main.cpp inside a loop it doesn't work. It calls the procedure once and it escapes the loop.

Here is the code...

fb_ch_color.s

 AREA    |.text|, CODE, READONLY
 EXPORT  fb_ch_color

fb_ch_color PROC
    
    MOV R1, #0
    MOV R2, #0
    MOV R3, #0
    MOV R4, #0
    
    ; Load GPIO Port 1 base address in register R1 
    LDR     R1, =0x2009C020 ; 0x2009C020 = GPIO port 1 base address
    LDR     R4, =leds

    ;CLEAR LEDS
    MOV.W   R2, #0xB40000
    STR   R2, [R1,#0x1C]
    
    ADD R3, R3, #12 ; COUNTER FOR LOOP
    ADD R0, R0, #1  ; INCREMENT PARAMETER BY 1
    
loop
    ; CLEAR THE CARRY FLAG
    ADC R0, R0, #0

    LDR.W   R2, [R4, R3]
    LSRS R0, R0, #1  

    STRCC   R2, [R1,#0x1C]  ; if==0, clear LED bit
    STRCS   R2, [R1,#0x18]  ; if==1, set LED bit    

    CMP R3, #0
    SUB R3, R3, #4
    BNE loop
    
    BX      LR
    ENDP

 AREA |.data|, DATA, READWRITE
 EXPORT leds
         
;BIT MASK FOR LED1...LED4

leds DCD 0x040000 
     DCD 0x100000
     DCD 0x200000
     DCD 0x800000
    
    ALIGN
    END

main.cpp

#include "mbed.h"
extern "C" int fb_ch_color(int value);

DigitalOut myled1(LED1);
DigitalOut myled2(LED2);
DigitalOut myled3(LED3);
DigitalOut myled4(LED4);

Timer t;
Serial pc(USBTX, USBRX); // tx, rx

int main() {
   
 int i = 1; 
 // THE FOLLOWING 3 FUNCTION CALLS WORK PERFECTLY
 fb_ch_color(15);
 wait(1);
  
 fb_ch_color(10);
 wait(1);
  
 fb_ch_color(5);
 wait(1);

 // THE FUNCTION WOULD ONLY BE CALLED ONCE
 for (int i = 1; i < 16; i++) {
     fb_ch_color(i);
     wait(1);
 }
 
}

Any help would be greatly appreciated!

Andy K

# 27 Jan 2011

Just a guess but what about making i in for (int i = 1; ... volatile to make sure the compiler doesn't assign it to a register?

[Edit: always worth publishing a program and referencing it in a forum posts for others to try. I tried cut'n'paste your code from this post and it doesn't compile.]]

Simon Ford

# 27 Jan 2011

Hi J.P,

I think your problem is actually just you are smashing R4. The call standard for ARM says that R0-R3 are the argument registers (i.e. it is up to the caller to save them if it wants to preserve the values), whilst R4 is a variable register, so it is up to the callee (your function) to maintain the value across the call.

So, as a simple proof point, adding PUSH {R4} just after entering your function, and POP {R4} just before returning will likely solve your problem.

Of course, there are other things you could think about too:

Could the implementation be more optimal
You are really meant to keep the stack 8-byte aligned, but as you are a leaf function (you don't call any further functions) you can get away with just pushing 4-bytes to the stack. As a note, one trick to do this would be to push/pop a register you don't care about, and is a trick a compiler would often use (but can be confusing!)

Lots more in the ARM Procedure Call Standard if you get really interested, although there are probably better tutorials out there.

Hope that helps,

Simon

Igor Skochinsky

# 27 Jan 2011

You can kill two birds with one stone: in prologue add PUSH {R4, LR} and in epilogue POP {R4, PC} (instead of BX LR). This way you can save one instruction and keep the stack aligned :)

Another thing you could do is use one base register and load with auto-decrement:

LDR R4, =(leds+12) ; point to the last entry in array
...
LDR R2, [R4], #-4 ; load from address in R4 and decrement it by 4
...

J.P. Armstrong

# 28 Jan 2011

Thank you for the help! I was able to get it working using the PUSH POP instructions you all suggested.

Igor, see where your going with the LDR R2, [R4], #-4. My question is.. I'm currently using CMP R3, #0 to exit the loop. Is there an equivalent to CMP R4, =leds ? Or is there another way of exiting it.

Here is the published code: http://mbed.org/users/jp/programs/IntToBinaryAssembly/llpoor

Again thank you for your help!!

this will be a implemented in a video game console based on the mbed. http://www.mbedgc.com

Igor Skochinsky

# 27 Jan 2011

You could load leds into another register and compare... but I think just counting down to 0 is just as fine. Here's a listing with some other changes:

 AREA |.data|, DATA, READWRITE
 EXPORT leds
         
;BIT MASKS FOR LED1...LED4

LED1 EQU 0x040000 
LED2 EQU 0x100000
LED3 EQU 0x200000
LED4 EQU 0x800000
ALLLEDS EQU LED1 :OR: LED2 :OR: LED3 :OR: LED4

leds DCD LED1, LED2, LED3, LED4
    
    ALIGN

 AREA    |.text|, CODE, READONLY
 EXPORT  binasm

GPIO1BASE  EQU 0x2009C020; GPIO port 1 base address
GPIOSETOFF EQU 0x18
GPIOCLROFF EQU 0x1C

binasm PROC
    PUSH {R4, LR}
    
    ; Load GPIO Port 1 base address in register R1 
    LDR     R1, =GPIO1BASE
    LDR     R4, =(leds+12) ; point to the last element of array

    ;CLEAR LEDS
    MOV   R2, #ALLLEDS
    STR   R2, [R1,#GPIOCLROFF]
    
    MOV R3, #4      ; COUNTER FOR LOOP
    ADD R0, R0, #1  ; INCREMENT PARAMETER BY 1
    
loop
    LDR  R2, [R4], #-4 ; load value from the address in R4 and decrement it by 4
    LSRS R0, R0, #1    ; shift R0 right and set carry to the shifted out bit

    STRCC   R2, [R1,#GPIOCLROFF]  ; if==0, clear LED bit
    STRCS   R2, [R1,#GPIOSETOFF]  ; if==1, set LED bit    

    SUBS R3, R3, #1
    BNE loop
    
    POP {R4, PC}
    ENDP
    
    ALIGN

    END

J.P. Armstrong

# 28 Jan 2011

Thank you Igor! Your changes have made it more efficient.

I ran the both procedures 10000 times in a loop.

Original took 6876 usec
Igor's changes 5626 usec

With what we're doing every nanosec counts.

Updated: http://mbed.org/users/jp/programs/IntToBinaryAssembly/llpoor

Simon Ford

# 28 Jan 2011

Hi JP, Igor,

This is an interesting thread, and great to see it working and some assembly optimisations! It also seems like a good opportunity to bang a little drum to highlight the fact that often writing code in C can give equally good results.

My argument would generally be, if you can get nearly as good results in C, you should write it in C. It is more understandable, maintainable and portable, and much less prone to bugs. The compiler can do all sorts of admin on our behalf (register allocation, function entry, exit, type checking etc), but also apply lots of optimisations. Writing in assembly can often be slower and make you focus on a very small window, so you don't step back nor want to rework it at a higher level. So here is a by-example walkthrough, based on your code.

Here is the C equivalent of the asm routine, but actually written in a more high-level expressive way. Note for example, every loop is expressing the bit extraction as a function of i:

#define L1 0x040000 
#define L2 0x100000
#define L3 0x200000
#define L4 0x800000
#define ALLLEDS (L1 | L2 | L3 | L4)

const uint32_t masks[] = {L1, L2, L3, L4};

void binc(int value) {
    LPC_GPIO1->FIOCLR = ALLLEDS;
    value += 1;
    for(int i=0; i<4; i++) {
        if((value >> i) & 1) {
            LPC_GPIO1->FIOSET = masks[i];
        } else {
            LPC_GPIO1->FIOCLR = masks[i];
        }
    }         
}

Now, how much do you pay for writing it in C?

Timing Assembly
 - 59377 us
Timing C translation
 - 56252 us

You get faster code!

But the real advantage of C is how quickly you can optimise. And by this I don't mean doing the compilers job, I mean expressing the problem in a better algorithmic or functional way.

For example, writing to peripheral memory-map registers will be slower than internal core registers. But the compiler has to honour every peripheral read and write you express in your code; it can't optimise them away. So why not minimise the writes to the peripheral registers by packaging up the things you want to write:

// creating the mask values before writing them to registers
void binc2(int value) {
    value += 1;
    uint32_t set = 0;
    uint32_t clr = 0;
    
    for(int i=0; i<4; i++) {
        if((value >> i) & 1) {
            set = masks[i];
        } else {
            clr = masks[i];
        }
    }         
    LPC_GPIO1->FIOSET = set;
    LPC_GPIO1->FIOCLR = clr;
}

Now we get:

Timing Assembly
 - 59377 us
Timing C translation
 - 56252 us
Timing C reg-writes outside the loop
 - 53126 us

Faster still! Now, a standard trade-off you might make is a space-time trade-off. i.e. I spend code/data space in return for faster execution. In this example, i'll use data space by creating a lookup of my LED translations:

const uint32_t masks2[] = {
    L1,
    L2,
    L2 | L1,
    L3,
    L3 | L1,
    L3 | L2,
    L3 | L2 | L1,
    L4,
    L4 | L1,
    L4 | L2,
    L4 | L2 | L1,
    L4 | L3,
    L4 | L3 | L1,
    L4 | L3 | L2,
    L4 | L3 | L2 | L1
};

// using a full lookup for the masks (space vs time tradeoff)
void binc3(int value) {
    uint32_t set = masks2[value];
    uint32_t clr = ~masks2[value] & ALLLEDS;
    LPC_GPIO1->FIOSET = set;
    LPC_GPIO1->FIOCLR = clr;
}

And the results from this?

Timing Assembly
 - 59377 us
Timing C translation
 - 56252 us
Timing C reg-writes outside the loop
 - 53126 us
Timing C with mask lookup table
 - 15626 us

So this algorithmic change just bought us a big speedup. There are probably others too, or better ways to do it. In some ways this is moving away from expressing the problem to expressing how to solve it, but is a classic space-time translation so the code is actually pretty clear.

Finally, I then just want to show a last thing. In the code, i've been calling it as you have using i % 16. Whilst I realise this is just a test, it highlights that efficiencies can be gained in lots of places. By changing it to i & 0xF, the compiler doesn't have to do a full modulo calculation:

    START("C with mask lookup table");
    for(int i=0; i<LOOPS; i++) {
         binc3(i % 16);
    }
    STOP();

    START("C with mask lookup table, but caller loop optimised");
    for(int i=0; i<LOOPS; i++) {
         binc3(i & 0xF);
    }
    STOP();

So here are the final results:

Timing Assembly
 - 59377 us
Timing C translation
 - 56252 us
Timing C reg-writes outside the loop
 - 53126 us
Timing C with mask lookup table
 - 15626 us
Timing C with mask lookup table, but caller loop optimised
 - 12501 us

So that shows around 4-5x speedup. And in C :) For reference, the whole code is here:

Import program IntToBinaryAssembly

Examples of C vs ASM optimisation

This is not really meant to be specific to your example (some of what I've done may not be applicable to your system/constraints), and in fact the code may not even be quite equivalent/bugfree, but hopefully highlights that a compiler does a very good job. It worries about a load of the housework as well as optimisations, which means you can spend more time thinking about how to map a problem.

Hope that was useful!

Simon

Hi JP, Igor, This is an interesting thread, and great to see it working and some assembly optimisations! It also seems like a good opportunity to bang a little drum to highlight the fact that often writing code in C can give equally good results. My argument would generally be, if you can get nearly as good results in C, you should write it in C. It is more understandable, maintainable and portable, and much less prone to bugs. The compiler can do all sorts of admin on our behalf (register allocation, function entry, exit, type checking etc), but also apply lots of optimisations. Writing in assembly can often be slower and make you focus on a very small window, so you don't step back nor want to rework it at a higher level. So here is a by-example walkthrough, based on your code. Here is the C equivalent of the asm routine, but actually written in a more high-level expressive way. Note for example, every loop is expressing the bit extraction as a function of i: <<code>> #define L1 0x040000 #define L2 0x100000 #define L3 0x200000 #define L4 0x800000 #define ALLLEDS (L1 | L2 | L3 | L4) const uint32_t masks[] = {L1, L2, L3, L4}; void binc(int value) { LPC_GPIO1->FIOCLR = ALLLEDS; value += 1; for(int i=0; i<4; i++) { if((value >> i) & 1) { LPC_GPIO1->FIOSET = masks[i]; } else { LPC_GPIO1->FIOCLR = masks[i]; } } } <</code>> Now, how much do you pay for writing it in C? {{{ Timing Assembly - 59377 us Timing C translation - 56252 us }}} You get faster code! But the real advantage of C is how quickly you can optimise. And by this I don't mean doing the compilers job, I mean expressing the problem in a better algorithmic or functional way. For example, writing to peripheral memory-map registers will be slower than internal core registers. But the compiler has to honour every peripheral read and write you express in your code; it can't optimise them away. So why not minimise the writes to the peripheral registers by packaging up the things you want to write: <<code>> // creating the mask values before writing them to registers void binc2(int value) { value += 1; uint32_t set = 0; uint32_t clr = 0; for(int i=0; i<4; i++) { if((value >> i) & 1) { set = masks[i]; } else { clr = masks[i]; } } LPC_GPIO1->FIOSET = set; LPC_GPIO1->FIOCLR = clr; } <</code>> Now we get: {{{ Timing Assembly - 59377 us Timing C translation - 56252 us Timing C reg-writes outside the loop - 53126 us }}} Faster still! Now, a standard trade-off you might make is a space-time trade-off. i.e. I spend code/data space in return for faster execution. In this example, i'll use data space by creating a lookup of my LED translations: <<code>> const uint32_t masks2[] = { L1, L2, L2 | L1, L3, L3 | L1, L3 | L2, L3 | L2 | L1, L4, L4 | L1, L4 | L2, L4 | L2 | L1, L4 | L3, L4 | L3 | L1, L4 | L3 | L2, L4 | L3 | L2 | L1 }; // using a full lookup for the masks (space vs time tradeoff) void binc3(int value) { uint32_t set = masks2[value]; uint32_t clr = ~masks2[value] & ALLLEDS; LPC_GPIO1->FIOSET = set; LPC_GPIO1->FIOCLR = clr; } <</code>> And the results from this? {{{ Timing Assembly - 59377 us Timing C translation - 56252 us Timing C reg-writes outside the loop - 53126 us Timing C with mask lookup table - 15626 us }}} So this algorithmic change just bought us a big speedup. There are probably others too, or better ways to do it. In some ways this is moving away from expressing the problem to expressing how to solve it, but is a classic space-time translation so the code is actually pretty clear. Finally, I then just want to show a last thing. In the code, i've been calling it as you have using {{{i % 16}}}. Whilst I realise this is just a test, it highlights that efficiencies can be gained in lots of places. By changing it to {{{i & 0xF}}}, the compiler doesn't have to do a full modulo calculation: <<code>> START("C with mask lookup table"); for(int i=0; i<LOOPS; i++) { binc3(i % 16); } STOP(); START("C with mask lookup table, but caller loop optimised"); for(int i=0; i<LOOPS; i++) { binc3(i & 0xF); } STOP(); <</code>> So here are the final results: {{{ Timing Assembly - 59377 us Timing C translation - 56252 us Timing C reg-writes outside the loop - 53126 us Timing C with mask lookup table - 15626 us Timing C with mask lookup table, but caller loop optimised - 12501 us }}} So that shows around 4-5x speedup. And in C :) For reference, the whole code is here: <<program /users/simon/programs/IntToBinaryAssembly/latest>> This is not really meant to be specific to your example (some of what I've done may not be applicable to your system/constraints), and in fact the code may not even be quite equivalent/bugfree, but hopefully highlights that a compiler does a very good job. It worries about a load of the housework as well as optimisations, which means you can spend more time thinking about how to map a problem. Hope that was useful! Simon

Igor Skochinsky

# 28 Jan 2011

Ah, I wanted to do that too but you were quicker :) BTW, The last two will be the same if you switch to unsigned int for the loop counter.

Simon Ford

# 28 Jan 2011

And just a final example, to stick up for using abstraction :) Here are the same examples using mbed libraries!

PortOut ledport(Port1, ALLLEDS);

void binc6(int value) {
    uint32_t set = 0;    
    for(int i=0; i<4; i++) {
        if((value >> i) & 1) {
            set |= masks[i];
        } 
    }
    ledport = set;
}

void binc7(int value) {
    ledport = masks2[value];
}

With the following results respectively:

Timing PortOut
 - 68289 us
Timing PortOut lookup
 - 15626 us

So you can see, it is still almost as good! The last example turns the whole ASM routine in to a lookup table and 1 line of code! And is 4x faster!

Igor: Exactly. It is often things like a compiler having to assume a number could be anything, when you know its limits, that limits further optimisation. In this case, it would actually be possible for a compiler to optimise this as the loop limits are explicit, but in this case it obviously doesn't as I suspect it is not too common.

Simon

Important changes to forums and questions