Timing-critical high speed data output

# 16 Jun 2011

Good afternoon all,

I am trying to write code that will produce 10 Mbit/s, Manchester-encoded data. Manchester-encoded means a '1' looks like '01' and a '0' will look like '10'. 10 Mbit/s means a bit will last 100 ns, so a half-bit will last for 50 ns; i.e. 5 clock cycles. The good news is that the mbed controller seems to be quite capable of this, and I have code that works: almost.

My first attempt was using FastOut (found here on mbed.org), and looks like this:

   
template <PinName pin> class FastOut
{
// pin = LPC_GPIO0_BASE + port * 32 + bit
// port = (pin - LPC_GPIO0_BASE) / 32
// bit  = (pin - LPC_GPIO0_BASE) % 32

    // helper function to calculate the GPIO port definition for the pin
    // we rely on the fact that port structs are 0x20 bytes apart
    inline LPC_GPIO_TypeDef* portdef() { return (LPC_GPIO_TypeDef*)(LPC_GPIO_BASE + ((pin - P0_0)/32)*0x20); };

    // helper function to calculate the mask for the pin's bit in the port
    inline uint32_t mask() { return 1UL<<((pin - LPC_GPIO0_BASE) % 32); };

public:
    FastOut()
    {
        // set FIODIR bit to 1 (output)
        portdef()->FIODIR |= mask();
    }
    void write(int value)
    { 
        if ( value )
          portdef()->FIOSET = mask();
        else
          portdef()->FIOCLR = mask();
    } 
    int read()
    {
        return (portdef()->FIOPIN) & mask() != 0;
    }
    FastOut& operator= (int value) { write(value); return *this; };
    FastOut& operator= (FastOut& rhs) { return write(rhs.read()); };
    operator int() { return read(); };
};

FastOut<p5> txd;

void snd(uint32_t cmd)
{
  uint32_t invcmd;
  invcmd=~cmd;
  __disable_irq(); 
  txd=0; // start marker
  txd=1; // start marker
  txd=0; // start marker
  txd=(invcmd&1);
  __nop();
  txd=(cmd&1);
  __nop();
  txd=0; // end marker
  txd=1; // end marker
  txd=0; // end marker
  __enable_irq(); 
}

This is not fully functional code, but fortunately, it is not too complex, and you need a high-speed probe & scope in any case to debug the code. The main point here is the snd routine: It sends a start marker (010) to trigger the scope, next it sends the LSB of cmd: '10' if it is a '0', and '01' if it is a '1'. Finally, it sends an end marker to verify that the total duration of the bit is correct.

The results are as follows:

For a '0':

/media/uploads/Wazonga/print_00.gif

For a '1':

/media/uploads/Wazonga/print_01.gif

Analysing what we see (expressed in clock cycles), there is :

For a 0: 010 0011111100 010

For a 1: 010 0000000111 010

The good news is that the duration of the bit is OK, (in fact the nop's are helping to stretch the bit out to 100 ns and we would be able to do it in 80 ns if we needed to), but my problem is that the 1->0 transition of the '0' is appearing 1 clock cycle after the 0->1 transition of the '1'.

Next, I tried this:

inline LPC_GPIO_TypeDef* portdef() { return (LPC_GPIO_TypeDef*)(537509888); }; // p5

void snd(uint32_t cmd)
{
  uint32_t invcmd;
  invcmd=~cmd;
  __disable_irq(); 
  portdef()->FIODIR |= 512; // set direction of p5 to out
  portdef()->FIOCLR =512; // generate start marker bit
  portdef()->FIOSET =512; // generate start marker bit
  portdef()->FIOCLR =512; // generate start marker bit
  if (cmd&1){
    portdef()->FIOCLR=512;
    portdef()->FIOSET=512;
    }
  else{
    portdef()->FIOSET=512;
    portdef()->FIOCLR=512;
    }
  portdef()->FIOCLR =512; // generate end marker bit
  portdef()->FIOSET =512; // generate end marker bit
  portdef()->FIOCLR =512; // generate end marker bit
  __enable_irq(); 
}

And the result was:

For a '0': 010 001000 010

For a '1': 010 000001 010

Now, the problem has gotten worse: The 1->0 transition of the '0' appears two clock cycles before the 0->1 transition of the '1'. (The bit time is also no longer correct, but we can pad that out with nops later)

I suppose my questions are:

a) Is there any way to examine the assembly output of the compiler to better understand what it is doing?

b) Is there any way to change the compiler optimisation settings? For example, to prevent it from eliminating 'pointless' nop's?

c) Is there any way to simply 'write data to a pin'? I suppose that would get rid of the the data dependency of the transition time. But if (deep down inside) we always need to work with FIOSET and FIOCLR and conditional execution depending on some value, then the problem may be quite difficult to work around, short of self-modifying code.

- Does anybody have other suggestions?

I am quite new to this so if there is something obvious I have overlooked, please feel free to comment!

Kind regards,

Arnoud.

Good afternoon all, I am trying to write code that will produce 10 Mbit/s, Manchester-encoded data. Manchester-encoded means a '1' looks like '01' and a '0' will look like '10'. 10 Mbit/s means a bit will last 100 ns, so a half-bit will last for 50 ns; i.e. 5 clock cycles. The good news is that the mbed controller seems to be quite capable of this, and I have code that works: ** almost. My first attempt was using FastOut (found here on mbed.org), and looks like this: <<code>> template <PinName pin> class FastOut { // pin = LPC_GPIO0_BASE + port * 32 + bit // port = (pin - LPC_GPIO0_BASE) / 32 // bit = (pin - LPC_GPIO0_BASE) % 32 // helper function to calculate the GPIO port definition for the pin // we rely on the fact that port structs are 0x20 bytes apart inline LPC_GPIO_TypeDef* portdef() { return (LPC_GPIO_TypeDef*)(LPC_GPIO_BASE + ((pin - P0_0)/32)*0x20); }; // helper function to calculate the mask for the pin's bit in the port inline uint32_t mask() { return 1UL<<((pin - LPC_GPIO0_BASE) % 32); }; public: FastOut() { // set FIODIR bit to 1 (output) portdef()->FIODIR |= mask(); } void write(int value) { if ( value ) portdef()->FIOSET = mask(); else portdef()->FIOCLR = mask(); } int read() { return (portdef()->FIOPIN) & mask() != 0; } FastOut& operator= (int value) { write(value); return *this; }; FastOut& operator= (FastOut& rhs) { return write(rhs.read()); }; operator int() { return read(); }; }; FastOut<p5> txd; void snd(uint32_t cmd) { uint32_t invcmd; invcmd=~cmd; __disable_irq(); txd=0; // start marker txd=1; // start marker txd=0; // start marker txd=(invcmd&1); __nop(); txd=(cmd&1); __nop(); txd=0; // end marker txd=1; // end marker txd=0; // end marker __enable_irq(); } <</code>> This is not fully functional code, but fortunately, it is not too complex, and you need a high-speed probe & scope in any case to debug the code. The main point here is the snd routine: It sends a start marker (010) to trigger the scope, next it sends the LSB of cmd: '10' if it is a '0', and '01' if it is a '1'. Finally, it sends an end marker to verify that the total duration of the bit is correct. The results are as follows: For a '0': {{/media/uploads/Wazonga/print_00.gif}} For a '1': {{/media/uploads/Wazonga/print_01.gif}} Analysing what we see (expressed in clock cycles), there is : ## For a 0: 010 0011111100 010 ## For a 1: 010 0000000111 010 The good news is that the duration of the bit is OK, (in fact the nop's are helping to stretch the bit out to 100 ns and we would be able to do it in 80 ns if we needed to), but my problem is that the 1->0 transition of the '0' is appearing 1 clock cycle **after** the 0->1 transition of the '1'. Next, I tried this: <<code>> inline LPC_GPIO_TypeDef* portdef() { return (LPC_GPIO_TypeDef*)(537509888); }; // p5 void snd(uint32_t cmd) { uint32_t invcmd; invcmd=~cmd; __disable_irq(); portdef()->FIODIR |= 512; // set direction of p5 to out portdef()->FIOCLR =512; // generate start marker bit portdef()->FIOSET =512; // generate start marker bit portdef()->FIOCLR =512; // generate start marker bit if (cmd&1){ portdef()->FIOCLR=512; portdef()->FIOSET=512; } else{ portdef()->FIOSET=512; portdef()->FIOCLR=512; } portdef()->FIOCLR =512; // generate end marker bit portdef()->FIOSET =512; // generate end marker bit portdef()->FIOCLR =512; // generate end marker bit __enable_irq(); } <</code>> And the result was: ## For a '0': 010 001000 010 ## For a '1': 010 000001 010 Now, the problem has gotten worse: The 1->0 transition of the '0' appears two clock cycles before the 0->1 transition of the '1'. (The bit time is also no longer correct, but we can pad that out with nops later) I suppose my questions are: a) Is there any way to examine the assembly output of the compiler to better understand what it is doing? b) Is there any way to change the compiler optimisation settings? For example, to prevent it from eliminating 'pointless' nop's? c) Is there any way to simply 'write data to a pin'? I suppose that would get rid of the the data dependency of the transition time. But if (deep down inside) we always need to work with FIOSET and FIOCLR and conditional execution depending on some value, then the problem may be quite difficult to work around, short of self-modifying code. - Does anybody have other suggestions? I am quite new to this so if there is something obvious I have overlooked, please feel free to comment! Kind regards, Arnoud.

Fred Scipione

# 17 Jun 2011

Arnoud,

The GPIO has the FIOnMASK hardware masks to restrict which bits are set during a write. See section 9.5.5 of the User Manual. You could set the mask just after you disable interrupts. Then just write (-invcmd) followed by (-cmd) to the char, short, or int at the required GPIO port address. The minus spreads bit 0 over the whole value written, and the hardware mask picks the needed bit for the output pin. Directly writing to the hardware port should mean less library code, which might eliminate the timing variations.

The code in your second example produced worse variations because of the 'if-else' construct. The ARM processor actually runs all of the generated instruction in both the 'if' and 'else' clauses every time, but with different results coming from 'conditional execution'. So, when the condition is false the code in the 'if' clause delays the execution of the 'else' clause, etc.

Have you thought about running at a CCLK of 80 MHz, and using/abusing the I2S hardware (Chapter 20) with a PCLK = CCLK/2, (1/1)/2 for the X_divide and Y_divide factors in the I2STXRATE register, and a divisor of 1 (the default) for the I2STXBITRATE? That should march data out at a 20MHz clock rate. For every byte to be sent in Manchester code, use the byte value as an index into an array of 256 pre-computed short ints (16 bits) and put the selected short value into the I2S que. With only 1/16th as much code to run, you won't miss the 20% reduction in CPU speed.

Fred

Arnoud van der Wel

# 17 Jun 2011

Hi Fred,

Thanks for your insightful reply.

You are correct; the if-then construction produces the timing variation.

In the first example, an if-then construction using FIOSET and FIOCLR also causes timing variation. Since 'everyone' is using FIOSET and FIOCLR with a conditional loop to write values to output ports, I (mistakenly) assumed that this was the only way to do it. In fact, you can write a value to an output pin, and this eliminates the data dependency of the output timing.

I modified Igor's FastOut routine to look like this:

template <PinName pin> class FastOut
{
// pin = LPC_GPIO0_BASE + port * 32 + bit
// port = (pin - LPC_GPIO0_BASE) / 32
// bit  = (pin - LPC_GPIO0_BASE) % 32

    // helper function to calculate the GPIO port definition for the pin
    // we rely on the fact that port structs are 0x20 bytes apart
    inline LPC_GPIO_TypeDef* portdef() { return (LPC_GPIO_TypeDef*)(LPC_GPIO_BASE + ((pin - P0_0)/32)*0x20); };

    // helper function to calculate the mask for the pin's bit in the port
    inline uint32_t mask() { return 1UL<<((pin - LPC_GPIO0_BASE) % 32); };

public:
  FastOut()
  {
  // set FIODIR bit to 1 (output)
      portdef()->FIODIR |= mask();
  }
  void write(int value)
  { 
    portdef()->FIOPIN = value*0xFFFFFFFF;
  }
  int read()
  {
    return (portdef()->FIOPIN) & mask() != 0;
  }
  FastOut& operator= (int value) { write(value); return *this; };
  FastOut& operator= (FastOut& rhs) { return write(rhs.read()); };
  operator int() { return read(); };
};

And now the timing variation depending on the data is gone, thereby effectively solving my problem.

I am not sure that writing 'value*0xFFFFFFFF' to the port is the most 'beautiful' way to tackle this, but it works, and the compiler is intelligent enough to turn that command into something fast, so it probably is not doing a 32 bit multiply... :)

I like your suggestion of using the I2S bus to solve the problem; I had not thought of that; but since the quick & dirty way works here, the slow & elegant way loses out I am afraid....

Thanks for your response, and Igor, if you are reading this: thank you for your FastOut code & please feel free to point out how I have broken it. :)

Kind regards,

Arnoud.

Hi Fred, Thanks for your insightful reply. You are correct; the if-then construction produces the timing variation. In the first example, an if-then construction using FIOSET and FIOCLR also causes timing variation. Since 'everyone' is using FIOSET and FIOCLR with a conditional loop to write values to output ports, I (mistakenly) assumed that this was the only way to do it. In fact, you **can** write a value to an output pin, and this eliminates the data dependency of the output timing. I modified Igor's FastOut routine to look like this: <<code>> template <PinName pin> class FastOut { // pin = LPC_GPIO0_BASE + port * 32 + bit // port = (pin - LPC_GPIO0_BASE) / 32 // bit = (pin - LPC_GPIO0_BASE) % 32 // helper function to calculate the GPIO port definition for the pin // we rely on the fact that port structs are 0x20 bytes apart inline LPC_GPIO_TypeDef* portdef() { return (LPC_GPIO_TypeDef*)(LPC_GPIO_BASE + ((pin - P0_0)/32)*0x20); }; // helper function to calculate the mask for the pin's bit in the port inline uint32_t mask() { return 1UL<<((pin - LPC_GPIO0_BASE) % 32); }; public: FastOut() { // set FIODIR bit to 1 (output) portdef()->FIODIR |= mask(); } void write(int value) { portdef()->FIOPIN = value*0xFFFFFFFF; } int read() { return (portdef()->FIOPIN) & mask() != 0; } FastOut& operator= (int value) { write(value); return *this; }; FastOut& operator= (FastOut& rhs) { return write(rhs.read()); }; operator int() { return read(); }; }; <</code>> And now the timing variation depending on the data is gone, thereby effectively solving my problem. I am not sure that writing 'value*0xFFFFFFFF' to the port is the most 'beautiful' way to tackle this, but it works, and the compiler is intelligent enough to turn that command into something fast, so it probably is not doing a 32 bit multiply... :) I like your suggestion of using the I2S bus to solve the problem; I had not thought of that; but since the quick & dirty way works here, the slow & elegant way loses out I am afraid.... Thanks for your response, and Igor, if you are reading this: thank you for your FastOut code & please feel free to point out how I have broken it. :) Kind regards, Arnoud.

Fred Scipione

# 18 Jun 2011

Arnoud,

1. Yes, the compiler should be smart enough to see 0xFFFFFFFF as (-1), and then collapse the code to (-value) instead of (value * -1). To be safe, you might use (-(value & 1)) instead.

2. You may want to set FIO0MASK = mask() before writing to the pin. Otherwise, you will run into the possibility that some other library code changed FIO0MASK before your write, and/or that your write will change other GPIO bits when you didn't want to. Set FIO0MASK = mask() before reading, too, if there is any possibility that the FIO0MASK may have had the needed bit cleared by other code. In general, you need to be aware of, and code against, the possibility that interrupt routines, etc., can change the hardware mask register after you set it, but before you read or write the GPIO port.

Fred

Yoko Hama

# 18 Jun 2011

Another way to avoid the if condition doing shift

a = 2 >> cmd;

cmd being either 1 or 0

so when cmd = 1 => a = 1 which is 01 when cmd = 0 => a = 2 which is 10

Jeroen Hilgers

# 19 Jun 2011

Trying to get your timing right in code means you are relying on implementation details of the compiler. If the compiler gets updated to a new revision, you project may break.

Also note that the flash in the LPC1768 is much slower than the processor. The result is that the processor will have to insert wait cycles when executing the program. See the section about 'Flash Accelerator' in the LPC1768 manual from NXP. The result is that it will be very hard to predict the exact execution time of the program. More important, if the compiler changes the location of your code in memory a little bit, the timing may change.

The best way to solve this, is by using some hardware peripheral like the I2S mentioned before. Probably the SPI interface is also usable.

Arnoud van der Wel

# 20 Jun 2011

@ Nguyen:

I like your idea; I had not thought of that, and it ties in very nicely with the desire to generate a manchester-encoded bitstream. But the problem is not really eliminating the 'if/then' construction in the main code (there are other ways of doing that), the problems was that there was an if/then construction in the code for the bit output to the port.

Replacing

  if ( value )
  portdef()->FIOSET = mask();
  else
  portdef()->FIOCLR = mask();

by

 portdef()->FIOPIN = value*0xFFFFFFFF;

solved my 'problem' completely.

@ Jeroen

The problems you point out are more serious, and, of course, completely correct. It smells of bad programming practice to rely on compiler implementation details to get timing correct. The 'classical' answer would be: 'But I will simply not update the compiler!'. Unfortunately, the mbed paradigm makes that essentially impossible. :(

I agree that the 'correct' or 'best' way to solve this is using a hardware peripheral. However, the boundary conditions of this project are such that (a) if it breaks, it doesn't really matter, I will simply re-debug the code with an oscilloscope. (b) I want quick results, and I don't want to spend time understanding how to configure peripherals. In fact, this is precisely why I chose mbed for this project, and I have to say it has not disappointed me: I went from zero to a completely usable prototype in about two days.

So I will accept that the code is not very robust, but for now, it works, and that is sufficient.

Thank you for all your inputs!

Arnoud.

Yoko Hama

# 20 Jun 2011

2 things First it's very costly to multiply Second not sure the result of value * 0xFFFFFFFF give the correct result. Assuming value is correspond to the bit to set/reset. So if value = 0 => value * 0xffffffff = 0 then you just reset all port to 0, not just the one you are using. It affect other i/o if any of them are used. If value = 1 then you set all ports to 1.

a better way is :

let say the port bit is at bit 4 (...00100b) so value assuming to be either 0b or 100b. Your mask, based on what i see, is 100b to set the port to the value is

port = (port & (mask())) | value;

this is much faster than multiply and I suggest you preset the mask to ...11111011b (mask()) so all you need to do is

port = (port & realmask) | value;

Important changes to forums and questions