Why is the mbed so slow?

10 Mar 2010 . Edited: 20 Mar 2010

Hey guys,

I got my mbed about 10 hours ago and have been playing with it. I decided to test the power of the 100MHz Cortex-M3 behind the mbed by seeing how smoothly it can drive a 320 x 240px QVGA LCD I got from ThaiEasyElec (TEE). I previously ran it off a 8MHz ATMega32, and earlier today managed to get it working on my XMega128a1 that runs at 32MHz. I am using the library provided by TEE for the ELT240320TP LCD. The LCD is controlled via a 8-bit parallel bus for data, and 6 control pins for select, read, write, backlight and reset.

The 32MHz XMega ran their demo code (filling the screen with a color, drawing six rectangles and four squares, as well as rendering a line of text) in under ~800 ms. I used my iPhone as a stopwatch, and before the moment I hit start!

The exact same library was ported to the mbed using the APIs documented on the site and compiler such as DigitalOut, BusOut, etc. This same exact test took over TEN seconds to be completed on the mbed. I'm sorry, but as far as I know a 100MHz ARM Cortex-M3 is supposed to be faster than a 32MHz XMega.... Hell, even a 8MHz ATMega32 or else NXP is in trouble.

I even tried using the ClockControl library to bump the speed up to 128MHz -- which worked, but barely made it any faster.

Are the APIs so slow? I know Arduino's built in digitalwrite functions are pretty slow but wow - I'd imagine a LPC1768 to handle it.

-robodude666

10 Mar 2010

Hi Robodude,

The LPC1768 is a pretty nippy MCU, and the mbed libraries are pretty efficient, so I am surprised you are seeing sub 8MHz Arduino performance. I;d suggest it is something about your configuration or software, rather than the mbed itself.

Lets step through this and see where the problem is likely to be.

First off, it sounds like cracking the CPU clock up from 100MHz to 128MHz made little difference, so i'd sugges that the CPU isnt the bottle neck here. If the program was CPU bound, you'd have have seen a 30% increase in speed.

The IO libraries are pretty efficient. If you want to do some testing and benchmarking, that would be great (even better, publish the results back here!). You could do something like :

#include "mbed.h"

DigitialOut foo (p20);

int main () {
    while(1){
    foo = !foo;
    }
}

This would hammer the IO as fast as possible, and if you have access to an oscilloscope, you could measure just how fast you can get around this IO loop.

However, the SPI interface isnt bitbashed, there is dedicated hardware, so as long as data is being fed to the SPI block fast enough, the transfer rate to the display is only bounded by the frequency that you've set the SPI interface to.

http://mbed.org/handbook/SPI

You'll notice that the SPI interface defaults to 1MHz, so that just about anything that you connect will work, which is great for a newbie. The more advanced users will appreciate the consequences of ramping this up to 10MHz, to the high speed stuff is left as a choice.

If you are happy to publish your code, myself and other mbed'ers will be able to take a look and maybe make some suggestions. Simply right click your project in the compiler and select publish, then post the generated URL to this thread.

Hope that helps,

Chris

 

 

10 Mar 2010 . Edited: 10 Mar 2010

Chris,

Thank you for your reply. Sorry if I sounded rude, as I didn't intend on it. I was expected to have my pants blown off after jumping from an ATMega to an ARM Cortex-M3.

I wrote the following Parallel IO Benchmark, which counts from 0 to 255 and outputs it to the a bus consisting of p5-p12, the pins I use for the data bus for the LCD. I then connected a USB Logic Analyzer and sampled it @ 24MHz for 250 Million samples. The results were quite good, as it was able to count from 0 to 255 in under 1ms, or about 3us per write.

#include "mbed.h"

BusInOut lcd_db(p5, p6, p7, p8, p9, p10, p11, p12);

int main() {

    lcd_db.output();
    while(1) {
    
        for(int i = 0; i < 255; i++){
            lcd_db = i;
        }

    }
}

I then loaded my QVGA program that I test with last night and monitored the I/O with the logic analyzer again. The data output looked fine, but the control pins were most interesting. It took 180 nanoseconds to toggle CS, which is great as that's just above the minimum of 50 nanoseconds the datasheet specifies. The interesting thing is there was over 41us in between writes. A very long delay, considering the XMega @ 32MHz takes only 1.8us.

This seems to hint that the problem is actually in the code, somewhere, even though its a direct port from my working XMega project. Only thing I did was rewrite the macros that handle clearing/setting a bit to support the mbed's api, as well as defining all of the I/O variables I need. I will now go back and rewrite the project from scratch -- hopefully that helps things.

Cheers,

-robodude666

10 Mar 2010

Chris,

Sorry for the double post but I've rewritten a few things. I've published my applicaiton: QVGATest and would appriciate if it can be looked over. Everything is stored all in main.cpp because I had problems including a header file in multiple cpp files without getting a "multiple definition" error -- I'll seperate into multiple files when performance goes up.

The application is nearly identical to the one I used for my ATMega32 or XMega128 except that the write/clear lines have been replaced with mbed's I/O API. A BusInOut is also being used to represent what would be called a 8-bit PORT.

Both the mbed and XMega128 toggle CS in about 180 nanoseconds. Only the mbed has a 40us delay between function calls, i.e. between pushing out a byte and the next one being pushed out to the I/O.

-robodude666

10 Mar 2010

Hi Robodude,

Sorry if my reply sounded defensive, I was going for "jaunty, welcoming and co-opreative" :-)

On first glance at your code in the forum and the example you posted, I'd suspect that the use of the BusIn, BusOut, BusInOut would be the first thing to take a look at. The bus object is basically a composition of DigitalIn/Out objects, as the pinout of the LPC1768 meant there was not an entire 8 bit bus bought out. When you access a bus object, the required reads/writes are carried out on the appropriate slices of the appropriate GPIO blocks and the results composited. This clearly has overhead compared to a single bit GPIO access.

The QVGA touchscreen LCD we've used here was the one from Embedded Artists, and we found that it has a SPI interface (as well as a parallel one). Running this SPI interface at 12MHz gave pretty good performance, and might even out perform a parallel interface using the Bus objects.

If you only have the option to use the parallel interface, you might get better throughput using a SPI or I2C Serial->Parallel shift register, clocked as high as it will go.

I think I need to get the scope out and do some tests with different bus configurations...

Hope that helps...

Chris

 

10 Mar 2010

Chris,

Not at all -- I applogized just in case I did come off rude. You were indeed jaunty, welcoming and cooperative.

I considered the Embedded Artists' OLED QVGA but at $150 shipped from europe, without any documentation before purchasing it, it didn't feel like a worthwhile buy.

I have a couple 74HC595 ICs that I can pull out and test. I've haven't played with them much but I know NXP does a good job on their datasheets, so it shouldn't be hard to figure out.

The Driver IC used by my LCD does support SPI but the LCD itself only supports 8-bit or 16-bit parallel (SPI is not brought out in the ribbon cable). Even worse, the breakout board only brings out 8-bit parallel.

-robodude666

10 Mar 2010

If you can't get around the parallel interface, one approach that should help is to choose pins that sit on the same port (e.g. P0) and then use "Fast GPIO" to set their values in one go. For mapping between mbed's pin numbers and LPC's ports see PinNames.h. There is even one chunk of 8 pins in a row, so you could write the value with just one shift. So, if you use the following pins:

P0_4 p30
P0_5 p29
P0_6 p8
P0_7 p7
P0_8 p6
P0_9 p5
P0_10 p28
P0_11 p27

you should be able to do something like this:

LPC_PINCON->PINSEL0 &= 0xFF0000FF; // set P0.4 to P0.11 as GPIOs (function = 0b00)
LPC_GPIO0->FIODIR |= 0xFF<<4; // set P0.4 to P0.11 as outputs (dir = 1)
LPC_GPIO0->FIOMASK = ~(0xFF<<4); // only change P0.4 to P0.11 (mask bits = 0)

// later

LPC_GPIO0->FIOPIN = byte_value << 4; // set values of P0.4 to P0.11 from byte_value


See LPC17xx User Manual for the the description of registers (chapters 8 and 9) and LPC17xx.h for their definitions.

10 Mar 2010

Hi Igor,

Thanks for that... I recalled (wrongly) that having pored over the exact pinout for literally weeks, there was not 8 consecutive bits from a port available at the DIP pins.  I guess the 8 bit bus did make it in afterall!

I guess PinMap.h is more reliable than my memory!

Thanks!
Chris

10 Mar 2010

Igor,

Thanks! I noticed the avaiable pins on P0.4-P0.11 on the schematic but had no idea how to write directly to a port as I haven't used a LPC before.

I'll test out both the shift register and direct LPC_GPIO access and see which works best.

-robodude666

10 Mar 2010
Chris Styles wrote:

Hi Igor,

Thanks for that... I recalled (wrongly) that having pored over the exact pinout for literally weeks, there was not 8 consecutive bits from a port available at the DIP pins.  I guess the 8 bit bus did make it in afterall!

I guess PinMap.h is more reliable than my memory!

I copied the list to a file and sorted by port number. After that the sequence was easy to spot :) Just in case someone else needs it, here's the complete sorted list grouped by consecutive chunks.

P0_0   p9   
P0_1   p10  

P0_2   USBTX
P0_3   USBRX

P0_4   p30  
P0_5   p29  
P0_6   p8   
P0_7   p7   
P0_8   p6   
P0_9   p5   
P0_10  p28  
P0_11  p27  

P0_15  p13  
P0_16  p14  
P0_17  p12  
P0_18  p11  

P0_23  p15  
P0_24  p16  
P0_25  p17  
P0_26  p18  

P1_18  LED1   

P1_20  LED2   
P1_21  LED3   

P1_23  LED4   

P1_30  p19  
P1_31  p20  

P2_0   p26  
P2_1   p25  
P2_2   p24  
P2_3   p23  
P2_4   p22  
P2_5   p21

 

I think it might be worth it to make a "FastBus" class that would cover those 8 pins and work with them as fast as possible. And maybe another one for the 6-pin group of P2.

11 Mar 2010 . Edited: 11 Mar 2010

The I/O is pretty slow,

the ugly routine below generates a square wave of about 1.55 Mhz, if I use "myled = !myled;" the speed drops to about half. I tried using an i/o instead of the declared led pin and it made no difference.

I'm using a digital oscilloscope that I trust.

I can't see the assembly code here but something is adding lots of delay.

I would expect to see a square wave at least ten times this speed if we are really running at 100mhz.

The weird thing is that the loop time is not measureable, both high and low times are about 320ns.

 

#include "mbed.h"

DigitalOut myled(p24);

int main() {
    while(1) {
        myled = 0x0;
        myled = 0x1;
      
    }
}

 

What gives?

 

I'm editing this to add that I just tried a similar toggle of I/O using the LPCXPRESSO on the LPC-LINK board with the LPC1343 module and it gave similar toggle times to the MBED module. Must be some kind of trick to getting faster I/O rates?

12 Mar 2010
Rocky Lavine wrote:
the ugly routine below generates a square wave of about 1.55 Mhz, if I use "myled = !myled;" the speed drops to about half

Same results posted here.

12 Mar 2010

I've done some investigation and could get some improvement: Fast GPIO with C++ templates

12 Mar 2010

I ran the simple code on my mbed as follows

    LPC_GPIO2->FIODIR |= TEST_1;

     while(1){

         LPC_GPIO2->FIOSET = TEST_1;          

        LPC_GPIO2->FIOCLR = TEST_1;          

     }

   

I measured the the delay between FIOSET and FIOCLR lines as 10 ns (corresponding closely to the 96 Mhz clock) and the overall time for the ‘while’ loop as 40 nsecs.

 

John.

15 Mar 2010
John Robbins wrote:

I ran the simple code on my mbed as follows

    LPC_GPIO2->FIODIR |= TEST_1;

     while(1){

         LPC_GPIO2->FIOSET = TEST_1;          

        LPC_GPIO2->FIOCLR = TEST_1;          

     }

   

I measured the the delay between FIOSET and FIOCLR lines as 10 ns (corresponding closely to the 96 Mhz clock) and the overall time for the ‘while’ loop as 40 nsecs.

 

John.

If you don't mind, what else do I need to run this loop in the MBED  compiler.

I'm pretty new to this stuff.

16 Mar 2010

The complete code is

#include "mbed.h"

DigitalOut led4(LED4);
DigitalOut led3(LED3);
DigitalOut led2(LED2);
DigitalOut led1(LED1);
#define TEST_1 (1<<5)
#define START_PULSE (1 << 24)
#define STOP_PULSE (1 << 25)
volatile int count;
volatile int loop_count;
int main(){
    led4 = 1;

	LPC_GPIO2->FIODIR |= TEST_1;
    LPC_GPIOINT->IO0IntEnR =  START_PULSE |  STOP_PULSE;   //1<<24 | 1<<25;
	LPC_SC->PCONP |= 1<<1;	// Timer0 Power On
	LPC_TIM0->PR = 0;		// no prescaler
	LPC_TIM0->CTCR = 0x0;	// default timer mode
	LPC_TIM0->TCR = 0x0;	// disable counter
	count = 0;
	loop_count = 0;

    while(1){
       while(LPC_GPIOINT->IntStatus==0);                   // look for either interrupt, start will be first
	   LPC_TIM0->TCR = 0x1;		//start timer
	   LPC_GPIO2->FIOSET = TEST_1; 
	   while((LPC_GPIOINT->IO0IntStatR & STOP_PULSE)==0);  // wait for stop (1 << 25)
	   LPC_TIM0->TCR = 0x0;		// stop timer
	   LPC_GPIO2->FIOCLR = TEST_1;           
	   count += LPC_TIM0->TC;
	   LPC_GPIOINT->IO0IntClr |= STOP_PULSE;               // clear stop pulse (1<<25)
	   while((LPC_GPIOINT->IO0IntStatR & START_PULSE)==0); // check start pulse (1 << 24)
	   LPC_TIM0->TCR = 0x2;		// reset timer
	   LPC_GPIOINT->IO0IntClr |= START_PULSE;              // clear start pulse (1<<24)
	   if(loop_count++ >= 10000) {
		   loop_count = 0;
            led1 = 0;
            led2 = 0;
            led3 = 0;
            led4 = 0;
            if(count<150000)
                led1 = 1;
            else if(count<200000)
                led2 = 1;
            else if(count<250000)
                led3 = 1;
            else 
                led4 = 1;
		   count = 0;
	   }
	}
}


This was simply to test a way of measuring short time intervals between two input pulses. The result is shown on the LEDs - if the delay is less than 630 ns, LED1 lights, less than 850 LED2, less than 1050 LED3, else LED4.

Hope this helps,

John.

16 Mar 2010

Thanks John I ran it.

 

18 Mar 2010 . Edited: 18 Mar 2010

I know I'm a little late to this thread but I'd like to at least mention what I saw when I was testing how fast I could toggle pins, etc.

When doing low level commands such as this:
LPC_GPIO0->FIOPIN = 0x.....
I could toggle an output as fast ast 48MHz (measured with a scope, basically 1 instruction per inversion).

With a for loop such as:
for (int i =0;i<16;i++) {
LPC_GPIO0->FIOPIN = waveform[i];
}
I could toggle a signal at ~6.8MHz (probably about 7 instructions per inversion)

I did however see something unusual when experimenting with all of this...  Say I used 4 bits on Port0 and basically constructed a 4 bit counter.  The counting was not done at a given interval... even if I hardcoded the values like below:

LPC_GPIO0->FIOPIN = (0x00000000);    //00000000
LPC_GPIO0->FIOPIN = (0x00000040);    //01000000
LPC_GPIO0->FIOPIN = (0x00000080);    //10000000
LPC_GPIO0->FIOPIN = (0x000000C0);   //11000000
The least significant bit(or all 4 bits for that matter) was not toggling at a fixed frequency... some toggles would take more instructions than others... I hadn't really figured it out though, as I began working on another project.

18 Mar 2010

Interesting. Were all four pins configured the same way?

18 Mar 2010 . Edited: 18 Mar 2010

It's been a while, but I believe so.

Actually, here is the code I think I used:

//Set Direction for Port 0. 1's are outputs
LPC_GPIO0->FIODIR |= (0x000003C0);

//Set pins on Port 0 that are controlled. 0's are controlled
LPC_GPIO0->FIOMASK |= (0xFFFFFC3F);

//Disable pullup/pulldown/open drain.
LPC_PINCON->PINMODE0 |= (0x000AA0A0);

I was controlling bits 6,7,8,9.  PINMODE0 looks a little weird but I think it takes 2 bits to configure each output pin.

19 Mar 2010 . Edited: 19 Mar 2010

Maybe an M3 guru can expand on this, but the M3 has a seperate I/O pin processing section that I believe can even be clocked at different rates than the main processor, so slower than instruction times or variable latency I/O is to be expected.

I'd love to run Tyler W's program that can toggle at 48mhz.

I believe that is possible based on my limited understanding of the M3's instruction set as bit banded I/O is a single clock instruction but I don't have a clear understanding of how the I/O section works or what speeds it is capable of.

20 Mar 2010

Just got this device in the mail this afternoon and noticed this thread and remembered the slow I/O I got on ancient LPC2292 devices I just needed to check...

Just loaded the code snippet from Chris from the second post in this thread and I measure about 781kHz, that's 8051 performance we are talking here. I am sure a 100MHz uC must be able to do better especially with fast I/O.

Oh, and why would it take a 9618 bytes binary to toggle one I/O pin? Software problem?

20 Mar 2010

Hey guys,

Thanks for all the feedback in this thread. I'm going to give some of it a try, though I've overall given up on 32-bit microcontrollers.

I'm sorry but a 32MHz XMega128 murders the mbed. I have a 8-bit QVGA LCD connected to my XMega128. I can reach upwards of 1.44 million pixels per second output. To make it clear a single pixel is 2 bytes of information, so my little 8-bit 32MHz XMega is pumping data to the LCD at 23Mbps (megabits per second). In worst case scenarious, when I'm doing heavy calculations the performance drops to 370 thousand pixels per second, or about 5.9Mbps. I can use a 16-bit parallel interface and increase my performance over 3 fold. 8-bit parallel requires 6 operations (set MSB, WR low, WR high, set LSB, WR low, WR high), where as 16-bit can be done with 2 instructions to set MSB and LSB then simply pumping WR. This would be an increase in over 3x performance. In addition, the ATXMega128A1 has an EBI port that can be used to interface SDRAM which allows me to then use the PDCA/DMA, or even simple interrupts for even higher performance.

It is extremely unfortunate that an ARM Cortex-M3 100MHz chip can be outperformed by a 32MHz XMega or even a 12MHz ATMega644 -- and yes, that is a project where a 12MHz Mega644 can drive the QVGA LCD with a 16-bit interface and reach very high performance. The mbed LPC1768 can't even update the frame with a single color in under 10 seconds -- something that the XMega does in 50 milliseconds.

Cheers,

-robodude666

20 Mar 2010

Hi h n, All,

The main reason this operation is relatively slow at the moment is that we've got some software layers in there that are not yet optimised for speed, and in particular we've not let the compiler do inlining in the mbed library - the main thing was experimenting with an architecture that was easy to use, and compiled across different mcu implementations. I was especially interested that if we needed to change the approach/add some logic, this naturally gives some headroom to make sure things weren't slowed down if revised (people usually accept speed-ups, not slow downs!)

As you know, early optimisation is the root of all evil, but it seems the mbed API approach has faired well so I think we'll start introducing these optimisations as we improve the libraries. FYI, i've had test code running and the compiler generates code pretty close to what you'd write in ASM (usually one extra level of indirection as mapping is dynamic rather than static), so it'll be interesting to do some before/after benchmarks.

There is nothing stopping you going right down to the metal if you want real control, but with some selective optimisations we'll actually be able to get pretty close i think.

As to the code size, that is just the reflection of the other code that is there, supporting various std C libraries etc, and other functionality within objects. For example, you can do RPC calls on objects that obviously use string libraries, and objects can report an error via the serial debug if you e.g. try and allocate to an incorrect pin, hence needing some string literals and stdio. There is obviously code for setting up the stacks, PLLs etc etc. Easy to forget all the other stuff that is going on.

Of course, you could really get this down if all you wanted to do was toggle a I/O pin, and I'd certainly consider putting together a real 101 "bare metal" bootstrap environment if that might be interesting. But it is not like if you decide to toggle two pins your code size is going to double! Even if your code is quite complex. There is just a base level that naturally gets pulled in by default to give a nice environment to prototype in.

Simon

20 Mar 2010

Simon,

While 512KB flash is nearly endless amounts of workspace for the majority of projects, might I recommend doing a "Select Libraries" feature similar to AVR32 Studio's? The ability to check what functionality you want in your project (GPIO driver, SPI driver, Ethernet Driver, PLL/Clock, TC, RTC, etc) would be a nice touch and give the user the ability to only include libraries in their project that are required. In addition it can tie in the ability to import other people's works with a little AJAX autocomplete/suggestion search box for selecting projects.

Also, if you're going in to do optimizations please allow actual bus support for pins that are contiguous. If my bus consists of P0.4-11 write to P0 with a mask versus looking at every bit and changing each pin one at a time. Considering this is the only 8-bit wide segment of the LPC1768's pinout that is contiguous it might even be a different object and won't need to be implemented within BusIn/BusOut/BusInOut.

Cheers,

-robodude666

20 Mar 2010

Hmm, that's still faster than what I get with an MSP430 at 15 MHz, but I use 3 bytes per pixel and I do some calculation on the pixels in between. At max I get 525kps, when I have to translate a 4-bit bitmap into a full blown 18-bit pixel (requires one palette lookup).

20 Mar 2010

Hi robodude,

I've just picked up on this thread, and whilst I haven't had a chance to fully dig in to it, are you saying you are getting these speeds writing to the LCD using a single 8-bit port parallel write? If it is taking 10 seconds that sounds very wrong! Sounds like a clocking problem, or somehow your code is still using the abstract BusOut interface somewhere to drive the bus (which will be very slow, as it allows totally arbitrary buses out of individual pins, which are just made out of DigitalOuts, which as discussed in this thread are not optimised).

There is no technical reason why it should not be in the same ballpark as (better than) your comparative solution unless I've mis-understood a subtelty of the requirements. Can you post an example of the driver code you've got running that is causing you the problems?

Simon

 

20 Mar 2010 . Edited: 20 Mar 2010

robodude666 wrote:

In addition it can tie in the ability to import other people's works with a little AJAX autocomplete/suggestion search box for selecting projects.

You might like an update we've got coming soon... :D

Note: Just removed swearing from title :)

20 Mar 2010 . Edited: 20 Mar 2010

Simon,

Near the top of the thread is the full QVGATest program that I used; however below is a chunk of code that writes data to the LCD from ThaiEasyElec's AVR driver. I am aware that the example provided for the ILI9325 by ThaiEasyElec is totally wrong based on the datasheet but it does in fact work. I've since then rewrote the entire library and optimized it for my XMega but have not redid anything for the mbed yet - simply because I lost faith in it.

 

void TSLCDOutDat(unsigned short dat) //write data to LCD
{
    lcd_rs.write(1);

    lcd_rd.write(1);
    lcd_wr.write(0);

    lcd_db.output();
        
    lcd_db = dat >> 8;

    lcd_cs.write(0);
    lcd_cs.write(1);

    lcd_db = dat;

    lcd_cs.write(0);
    lcd_cs.write(1);

    lcd_wr.write(1);

    lcd_db.input();
}

 

My Logic Analyzer is only capable of collecting data up to 24MHz, but based what I saw the mbed was doing what it was supposed to but then taking upwards of 40 microseconds in between pixel writes.

While this delay may be caused by the unoptimized nature of the program, the improperly executed communication to the screen, or any number of user errors, these results were not viewed on a number of other microcontrollers.I've tested the same exact code (using macro definitions) on a 8MHz ATMega32, 16MHz ATMega128, 32MHz XMega128A1, and a 33MHz AT32UC3B1256 (AVR32) with very acceptable results for their abilities.

Only the mbed features this delay in between pixel writes, using the same library with only a few edits in the header file that contains the #define macros. Aside from the architecutres handling GPIO in completely diferent ways between the ARM and AVR, only conclusion is that there is something in the mbed's APIs.

 

 

Simon Ford wrote:

 

robodude666 wrote:
In addition it can tie in the ability to import other people's works with a little AJAX autocomplete/suggestion search box for selecting projects.
You might like an update we've got coming soon... :D

That's good to hear, although I'd much more like an offline version of the software to be honest. Even if it's an AIR/Cocoa app that calls mbed's online compiler script (to get around the licensing problems with the compiler, etc). While the online compiler and editor is great in theory I find it nearly impossible to get work done at a resonable rate. Way too many hijacks, bugs and differences that interfere with my workflow.

 

-robodude666

20 Mar 2010 . Edited: 20 Mar 2010

Hi robodude,

This code example looks like it is using the BusInOut class for the data output, rather than a port. If you have that, can you post it. If not, I think that is your problem right there!

If you want a really high bandwidth interface, that function will really need to access the port properly, just like you are doing on the AVR code. These functions that do the actual low level read/write are probably pretty much the only critical function in all of your code, so I suspect just doing the equivalent to the AVR will get you your speed.

What is great is you already did the first step of porting it across to the mbed and got it functional, so you can be confident it is basically working. Now you need to do the port write code to get the speed there. The thread looks like it has most of the info, but shout if you have any questions or need assistance; i'm sure someone can help out with the task.

Simon