9 years, 1 month ago.

Strange mbed hanging problem

Hi experts,

I am facing a strange problem. In my project, I am using the mbed LPC1768 platform and communicating serially with a device and transmitting data over the internet through the MQTT protocol. (There are many additional functions such as logging to SD card, controlling of digital peripherals etc.). The system also needs to respond to commands sent over MQTT

The dilemma I'm facing is illustrated in the following steps: 1) Compile and build code, and flash to mbed. 2) Reset mbed. Test all functions over many hours. All ok. 3) Shut down the system at the end of the day and leave office. 4) Come back the next day and power up the mbed again (same code as previous day) 5) Everything works fine except one function: whenever it receives any message over MQTT, the mbed hangs (and is eventually reset by the Watchdog) 6) This keeps occurring in future attempts as well. Recompiling and flashing the program has no effect 7) I go to my code and change the size of a char array (from 3000 to 3100, for example), recompile and flash it. And (usually), everything starts working fine again. 8) I test everything over many hours, then shut down the mbed and leave for the day. 9) Next day, the same thing happens again. The code which worked perfectly the previous day now has the same hanging problem.

Some additional points to note: 1) I've tried many small changes to the code but it's only changing the size of a particular char array (of desired minimum size 3000) that returns the program to normal 2) When the program hangs the second time with array size 3100, I change it back to 3000 and presto, it's normal again! 3) I've tried other numbers as well (3200, 3500 etc), so it doesn't seem to be a RAM limit issue 4) Sometimes, I don't even have to compile a new code. The .bin file of 2 days ago works fine if I reflash it, even though it caused a hang yesterday. 5) I've tried to follow all advice on the forums with respect to using rtos i.e. use RawSerial, no mallocs and printfs in ISRs etc (the rtos is required only for Ethernet. I'm not using multiple threads in my program) 6) The hanging happens at a random point after receiving the MQTT message, which suggests that the trigger is not the main loop, but the ISR

A basic structure of my program: The main loop basically handles MQTT and SD card logging. The entire serial communication is implemented as a multi callback system within the ISR.

What I fail to understand is how does a perfectly functioning program develop this problem after it is powered down. And how does changing the array size return it to normal again. Would appreciate advice from anyone who might have an idea what's going on.

Thanks, Farzan

Update: This issue stopped occurring after I made some changes to the received message handler. Posting the before and after versions here. My MQTT functionality is implemented in a class MQTT_connection.

Before:

//header file
class MQTT_connection
{
public:    
    static char received_message[20];
    static int message_ready;
//other variables...
}

//cpp file
char MQTT_connection::received_message[20] = "M";
int MQTT_connection::message_ready = 0;

void MQTT_connection::messageArrived(MQTT::MessageData& md) //Received message handler
{
    MQTT::Message &message = md.message;
    strcpy(received_message, (char*)message.payload);
    message_ready = 1;
}

//main
while(1) {
    if(MQTT_connection::message_ready) {
        //do something using MQTT_connection::received_message
        MQTT_connection::message_ready = 0;
    }
}

After:

//header file
class MQTT_connection
{
public:    
    static char *received_message; //changed to pointer
    static int message_ready;
//other variables...
}

//cpp file
char *MQTT_connection::received_message = NULL;
int MQTT_connection::message_ready = 0;

void MQTT_connection::messageArrived(MQTT::MessageData& md) //Received message handler
{
    MQTT::Message &message = md.message;
    received_message = new char[message.payloadlen + 1]; //used dynamic memory allocation
    strcpy(received_message, (char*)message.payload);
    message_ready = 1;
}

//main
while(1) {
    if(MQTT_connection::message_ready) {
        //do something using MQTT_connection::received_message
        delete[] MQTT_connection::received_message;
        MQTT_connection::message_ready = 0;
    }
}

The typical received message length is only 6 bytes. It will be nice if someone can explain why the above issue occurred in the first version.

2 Answers

8 years ago.

I faced similar problem in my lpc1549 project, few times, code not work at all and system hangs on power up.

9 years, 1 month ago.

hmm, sounds like some uninitialised variable or pointer.Depending on random content in the memory or in the arriving message the system may crash. A change in arraysize may move some code around (arrays should be auto-initialised to 0) and your code may run again. Can you do some sanity checks or debugging/printfs on the received message in the IRQ.

But a reset should clear the RAM, right? The received message is typically less than 20 bytes, so I don't understand how it should interfere. Incidentally, by the time the mbed hangs, it has already taken the action instructed by the message, so the message routine isn't the cause.

I put in a printf statement immediately after the received message routine. Before the hang, sometimes the whole message appeared, sometimes part of it, and sometimes nothing. This is what made me think that the problem originates in the ISR. Any other ideas for debugging strategies?

posted by Farzan Hasani 25 Sep 2015

Typically only variables in RAM will be cleared on reset. Memory allocated by mallocs or memoryblocks pointed at by some pointer are not.

Its not so much the length of the message but the content. maybe some fields are not filled. or the length is wrong (missing \0 at the end) that causes the problem.

Printf is tricky when you get a crash right afterwards. The serial port is slow and has some buffers that may or may not make it to your terminal before the crash. To be safe you need to wait for some time before continuing or wait for some input condition (key or digital pin).

The only option may be to use an offline tool and debugger to trace this problem,

posted by Wim Huiskamp 25 Sep 2015

Thanks for your input, Wim. I did some further debugging and made some code changes. It seems to be working fine now. I'm not entirely sure why the problem was occurring. I'll post the before and after versions of the relevant section of my code in a separate comment, and maybe you or someone else can give some insight on what was actually wrong

posted by Farzan Hasani 29 Sep 2015