speed of multiply - use as DSP?

19 Feb 2010 . Edited: 19 Feb 2010

Just got the board - very nice!  I wonder if it can be used as a poor man's DSP?  What is the speed of a 32x32bit multiply?

In our current project, we use a propeller (www.parallax.com) to read data from an audio A/D at 80kHz.  The decimation to lower sample rates is done... poorly... inside the propeller right now and I wonder if the mbed can be used as a sort of DSP co-processor?  In the approx. 12uS between samples, how many, for example FIR filter operations can be performed?

We are considering going to a "proper" DSP chip but this device looks promising as it can do other things as well...

Thanks for you help, and I look forward to your comments.

Nice job on the board, again!

Sincerely,
Sridhar

 

geopebble board

19 Feb 2010 . Edited: 19 Feb 2010

Check my port of two DSP libs: FFT There is also going to be a lib from NXP but it's not public yet.

As for multiply speed, the Cortex-M3 TRM says:

Multiply: 1 or 2 cycles.
MUL, MLA, and MLS. MUL is one cycle and MLA and MLS are two cycles.

Multiply with 64-bit result: 3-7 cycles. Cycle count based on input sizes. That is, ABS(inputs) < 64K terminates early.
UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size of source values. These are interruptible (abandoned/restarted), with worst case latency of one cycle. MLAL versions take four to seven cycles and MULL versions take three to five cycles. For MLAL, the signed version is one cycle longer than the unsigned.

"A" instructions are accumulating, i.e. a = b*c + d.

BTW, why stop at "coprocessor"? I'm quite sure a Cortex-M3 can do everything that Propeller can, only better :)

19 Feb 2010

I found some numbers in a recent NXP press release. They are given for the newest 120MHz model, so you'll need to reduce them by 20% to get equivalent for mbed's processor which runs at 100MHz.

"With 256-point 16-bit FFT execution time of less than 190 µs, this is 54 percent faster than the nearest Cortex-M3 alternative and challenges low-cost DSPs in performance. For 1024-point 16-bit FFT the execution time is less than 0.89 ms. These times include the FFT initialization and overhead of the algorithm."

The documentation has some numbers too.

Igor - thanks for the info.  To be clear, when you say the MUL or MLA takes 1 or 2 cycles, that would be one or two clock cycles of the 100MHz system clock?  In other words 10 or 20 ns, respectively?

Yes, I have considered movig from the prop but we have invested a fair bit of blood and sweat into it... there would have to be a large upside to move... both look like they have a place. (don't want to start a religious war! :) )