EthernetInterface robustness testing

30 Mar 2013

Hi

Following my question here, this is a thread to contribute on solving the problem of the EthernetInterface locking up after a few hours running.

Without solving this problem, there is no point trying to use the new stack for a continuously running application, even though you might be able to make a whizzy demo with it!

Why investigate this? I have a few thousand hours of uptime with my fork (Segundo) of the old NetServices running alongside scmRTOS. I thought it would be nice to use the "official" EthernetInterface and RTOS instead, but got a lock up after a couple of hours with the most basic of code.

This investigation will focus on using offline compilation and debugging to find out what is happening.

I've updated my demo code which locks up to the latest libraries:

Import programTCPSocket_HelloWorldTest

Test to demonstrate that TCP sockets lock up

Thanks Daniel

30 Mar 2013

While trying to get a DM9161 PHY working with mbed rtos+lwip, i found a new lwip topic (bug) on the lpcware community - see http://mbed.org/forum/bugs-suggestions/topic/4351/

For now i will stop testing and try scmRTOS + NetServices solution. There is no direct link to the NetServices on your repository page (found a link on http://mbed.org/forum/bugs-suggestions/topic/3403/). In order to allow me to get up and running faster, do you have demo code to share for scmRTOS + NetServices?

30 Mar 2013

Daniel, I downloaded your code and I will take a look at building it offline later today. Have you only modified the main.cpp from the original TCPSocket_HelloWorldTest sample?

31 Mar 2013

I was able to build the code with GCC and have your sample now running on my mbed device. I have linked in minimal debugger support so that I can get a call stack if it should crash. I would upload a zip of my version of your code ready to be built with GCC but I haven't been able to upload anything from OS X/Safari to the mbed website for months. If you PM me with your e-mail address, I can send you a copy of the updated sources and a pre-built .bin/.elf

Thanks,

Adam

31 Mar 2013

Just an update: I have had the sample running for 7.5 hours on my mbed and haven't hit a lockup yet. I will let it run overnight on my board.

01 Apr 2013

I just stopped the run on my device after 24 hours of running.

Sending at 86515 seconds
Received: 255 0 0
Received: 84 255 0
Received: 0 339 0


The issue could be related to particular network traffic that isn't occurring on my LAN.

01 Apr 2013

Hi Adam

Using your .bin, I got the lock up after 7 hours, which is typical for the tests I have done previously.

Sending at 26606 seconds
Received: 255 0 0
Received: 84 255 0
Received: 0 339 0
Sending at 26607 seconds

As there is no crash message, it is as you suggested probably hung in a loop and not crashed.

Is it possible to break in to the code to see what that loop is? Any other suggestions?

Thanks
Daniel

01 Apr 2013

Unfortunately due to the way that I did that build I sent you, there is no easy way to break in and see what is going on.

Later today I will send you another build in which you should be able to attach GDB after the hang occurs and it should break in automatically. Sorry that I didn't send you that version originally but I was hoping that you were hitting something like a Hard Fault and the current version would have worked a treat.

-Adam

01 Apr 2013

Frank Vannieuwkerke wrote:

For now i will stop testing and try scmRTOS + NetServices solution. There is no direct link to the NetServices on your repository page (found a link on http://mbed.org/forum/bugs-suggestions/topic/3403/). In order to allow me to get up and running faster, do you have demo code to share for scmRTOS + NetServices?

Hi Frank

I got my code working with scmRTOS and NetServices when other options weren't available. However, it still sits on top of lwip and uses the Ethernet class for the device driver. I'm trying to move away from it as it is unsupported (even by me now) and with a quick test I could not get my code to run when upgrading to the latest mbed library.

Regards
Daniel

01 Apr 2013

Daniel Peter wrote:

Frank Vannieuwkerke wrote:

For now i will stop testing and try scmRTOS + NetServices solution. There is no direct link to the NetServices on your repository page (found a link on http://mbed.org/forum/bugs-suggestions/topic/3403/). In order to allow me to get up and running faster, do you have demo code to share for scmRTOS + NetServices?

Hi Frank

I got my code working with scmRTOS and NetServices when other options weren't available. However, it still sits on top of lwip and uses the Ethernet class for the device driver. I'm trying to move away from it as it is unsupported (even by me now) and with a quick test I could not get my code to run when upgrading to the latest mbed library.

Regards
Daniel

Darn, back to square one (however, it's better to use the most recent code for new development). I will continue with mbed rtos+lwip.

Have you read my lwip bug topic on http://mbed.org/forum/bugs-suggestions/topic/4351/? Do you know if it is possible to lower the EMAC/PHY transfer speed besides changing the RMII management clock rate (perhaps a noob question - i'm still learning alot about lwip)?

01 Apr 2013

Daniel,

I have e-mailed you another build for you to try. I tried to only include the files which I modified since the last build in this archive to make it smaller. So that there wouldn't be contention on UART0, I did comment out the printf() calls from main(). Hopefully those aren't related to the hang. LED1 still blinks when it is running though. If the LED should stop blinking, you only need to connect GDB at 115200 baud and it should break into the running program.

If you want to keep the printf()s let me know. I have two solutions for that. One is to redirect them to GDB over the debugger connection. I would be worried that this approach would throw off the timing and make the bug disappear. The other solution is to connect GDB to one of the other UARTs on the mbed device but you would need something like a FTDI USB to TTL serial cable for that to work and its a bit harder for me to test here before I send it to you since the mbed I have wired up with ethernet is in a prototyping box and its not easy to get to those pins.

-Adam

01 Apr 2013

Hi Adam

Thanks very much, I've set it running which will be overnight for me. The printf()s are not essential; they were mainly to see what the elapsed time was before a hang (as I was leaving it unattended).

Regards
Daniel

02 Apr 2013

Hi Adam

It had stopped this morning, so I attempted to attach gdb. The result (as far as I got) is as follows:

D:\Users\Daniel\Desktop\mbed\TCPSocket_HelloWorld>D:\Users\Daniel\Desktop\mbed\a
damgreen-gcc4mbed-8234d7c\adamgreen-gcc4mbed-8234d7c\gcc-arm-none-eabi\bin\arm-n
one-eabi-gdb TCPSocket_HelloWorld.elf --baud 115200 -ex "set target-charset ASCI
I" -ex "set remotelogfile mri.log" -ex "target remote com3"
GNU gdb (GNU Tools for ARM Embedded Processors) 7.4.1.20121207-cvs
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=i586-mingw32 --target=arm-none-eabi".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from D:\Users\Daniel\Desktop\mbed\TCPSocket_HelloWorld/TCPSocket
_HelloWorld.elf...done.
Remote debugging using com3
Ignoring packet error, continuing...
warning: unrecognized item "timeout" in "qSupported" response
Ignoring packet error, continuing...
Ignoring packet error, continuing...
Ignoring packet error, continuing...
Ignoring packet error, continuing...
Ignoring packet error, continuing...
Ignoring packet error, continuing...
Malformed response to offset query, timeout
(gdb) set pagination off
(gdb) set logging on
Copying output to gdb.txt.
(gdb) bt
No stack.
(gdb) list
4       #define RETRIES_ALLOWED 5
5
6       //Serial pc(USBTX, USBRX);
7       EthernetInterface eth;
8       DigitalOut led1(LED1);
9
10      #define printf (void)
11
12      int main()
13      {
(gdb) disass
No frame selected.
(gdb)


The mri.log file says:

w +$qSupported:qRelocInsn+#9a
r <Timeout: 2 seconds>
w $qSupported:qRelocInsn+#9a
r <Timeout: 2 seconds>
w $qSupported:qRelocInsn+#9a
r <Timeout: 2 seconds>
w $qSupported:qRelocInsn+#9a
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+$Hg0#df
r <Timeout: 2 seconds>
w $Hg0#df
r <Timeout: 2 seconds>
w $Hg0#df
r <Timeout: 2 seconds>
w $Hg0#df
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+$?#3f
r <Timeout: 2 seconds>
w $?#3f
r <Timeout: 2 seconds>
w $?#3f
r <Timeout: 2 seconds>
w $?#3f
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+$Hc-1#09
r <Timeout: 2 seconds>
w $Hc-1#09
r <Timeout: 2 seconds>
w $Hc-1#09
r <Timeout: 2 seconds>
w $Hc-1#09
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+$qC#b4
r <Timeout: 2 seconds>
w $qC#b4
r <Timeout: 2 seconds>
w $qC#b4
r <Timeout: 2 seconds>
w $qC#b4
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+$qAttached#8f
r <Timeout: 2 seconds>
w $qAttached#8f
r <Timeout: 2 seconds>
w $qAttached#8f
r <Timeout: 2 seconds>
w $qAttached#8f
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+$qOffsets#4b
r <Timeout: 2 seconds>
w $qOffsets#4b
r <Timeout: 2 seconds>
w $qOffsets#4b
r <Timeout: 2 seconds>
w $qOffsets#4b
r <Timeout: 2 seconds><Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -
r <Timeout: 2 seconds>
w -+
End of log

Regards Daniel

02 Apr 2013

That is very unfortunate! The device is completely unresponsive! I will look through the code again later today to see if there is anything in that sample which could obviously lead to such a state.

JTAG may be helpful here! Samuel Mokrani from the mbed team has done some great work in this area recently. You may want to look at http://mbed.org/handbook/CMSIS-DAP-MDK since it discusses the latest mbed firmware which supports JTAG debugging. This page, https://github.com/mbedmicro/mbed/tree/master/workspace_tools/debugger, discusses how to use some Python code to allow GDB to utilize the new JTAG enabled firmware. The docs indicate that it doesn't work on OS X so that is one of the reasons I haven't tried it out yet myself. Of course the real reason is that I have been busy with other projects and don't have time to play with it properly.

-Adam

02 Apr 2013

Adam Green wrote:

I will look through the code again later today to see if there is anything in that sample which could obviously lead to such a state.

Hi Adam

The problem with debugging network problems is the complexity of the stack; a device driver below lwip below a socket implementation below a (simple) application, all running in an RTOS. I attempted to use non-blocking sockets to stop a lock up of a thread, but that non-blocking code relies on the lwip timers updating correctly which relies on the RTOS. So lots of things that could go wrong. If the lwip timers fail (perhaps because of an RTOS issue), then the non-blocking code in the socket implementation becomes an infinite loop instead (from which the RTOS won't escape). That's without pointing the finger at the device driver level...

I will have a play with the JTAG debugging, which could be of benefit for the real coding I want to do once the stack is working!

Thanks
Daniel

12 Apr 2013

looks like a problem with allocating too much memory to incoming connections and running into overlap issues...

27 May 2013

Hi Peter,

I too am suffering from the lock-up problem. Using your test program it locks up after a few hours.

I tried using the LEDs to show where it was locking up. The results were inconsistent.

I tried adding a second thread to flash an LED. In all lock-up cases the second thread halted.

I then added code to trigger a watchdog timer reset if a lock-up occurred. This was successful. I have now had the program running continuously for over 24 hours (albeit with quite a few restarts). The restarts are a pain, but in mitigation, it is no worse that what would happen if there was a power cut.

Here is the modified test program:

Import programTCPSocket_HelloWorldTest

Enhanced test program using watchdog timer to recover from lock-up

Paul

28 May 2013

Hi All,

I used the watchdog library and have my setup running for days now as well with some occasional restarts. But like Paul mentioned, any application should be able to recover after a power cut as well.

Guido

28 May 2013

Hi all!

I am having problems with the EthernetInterface for a long time now. I have some programs, which send environmental data to xivley (was named cosm before and pachube before that). I've tested with different protocols (Http, WEB-Socket) and different libraries (even with the new xivley-library it hangs), but the transfer crashes always after some hours. I added a watchdog to reset the mbed, but even this doesn't work always. Sometimes it hangs or doesn't get a new connection.

So I tested with the TCPSocket_HelloWorldTest from Paul (but without the watchdog). It has been running one time for about 24 hours and one time for 4 hours and stopped with LED4 on and all other leds off. Last message was:

Sending (10335) at 14469 seconds
Received: 255 0 0
Received: 83 255 0

As it seems that my program has problems to reconnect after a crash, the watchdog is not a good solution for this problem and it would be very helpful, if there is a solution for this problem. Or could it be something with my setup?

Thanks - Charly

01 Jun 2013

Hi guys

Thanks for all the feedback. At least it confirms it is not just me having problems. I don't want to go the watchdog route - in fact I went to the trouble of making a DC UPS to avoid power outage resets (as my mbed is running other real time tasks),

What I'd rather see happen is some real effort from the mbed team in debugging this. I've drawn a blank despite Adam's help.

Perhaps Emilio can advise where things are up to with the robustness testing that I've previously queried? I would have thought this was a priority as the mbed is being peddled as an internet of all things device.

Thanks Daniel

03 Jun 2013

Hi Daniel,

Daniel Peter wrote:

Perhaps Emilio can advise where things are up to with the robustness testing that I've previously queried?

We do not have anyone working on it at the moment.

Until now we have just integrated the following components:

and we have developed an high level C++ Socket API.

We are not actively developing the components integrated from external sources (lwIP and the Ethernet driver).

We did mainly care about defining an high level Socket API to allow the mbed community to write applications independently from the TCP/IP stack underneath it.

Now third parties can provide different TCP/IP stack implementations (perhaps more robust than lwIP 1.4 with the NXP Ethernet driver) and the mbed community can start using them without changing a single line of code in their applications.

A first good example of such a TCP/IP stack is PicoTCP developed by TASS.

We will still keep lwIP as the official TCP/IP stack, instead of PicoTCP, because our policy is to favour permissive open source licenses (Apache, BSD, MIT, etc) over the GPL open source license.

HTH, Emilio

29 Jun 2013

I've had a play with PicoTcp and posted a question on the results here:
https://mbed.org/questions/1274/PicoTCP-HelloWorld/

I'd appreciate it if others interested in a working TCP stack could see if they can reproduce my result.

Thanks
Daniel

05 Jul 2013

Hi, does anybody know if this problem is already fixed? Because we're having a similar problem, but in our case the EthernetInterface locks up after a few seconds. We've tracked the bug to this function:

EthernetInterface/lwip-eth/arch/lpc17_emac.c :: static struct pbuf *lpc_low_level_input(struct netif *netif)

It seems like it locks up in some part of this function, we're still trying to figure out in which part.

It would be very helpful if someone could tell if our problem is related to this issue.

Thanks.

19 Jul 2013

Hi everybody

I've started testing the PicoTCP stack and hit an issue they can't reproduce (perhaps down to my network).

Please can you visit https://mbed.org/users/daniele/code/PicoTCP/issues/1 and grab the latest test code http://mbed.org/users/mbed714/code/TCPSocket_HelloWorldTest_PicoTCP/ to see if you can reproduce it?

You might see something like this if it fails:

Sending at 258 seconds
Received: 255 0 0
Received: 82 255 0
Received: -1 337 0
Sending at 261 seconds
Received: -1 0 0
Received: -1 0 1
Received: -1 0 2
Received: -1 0 3
Received: -1 0 4
Received: -1 0 5

And if you leave it long enough, it might stop looping altogether (lock up). Please report on the issues page.

Thanks
Daniel

08 Aug 2013

Daniel, it sounds like you are making progress with the PicoTCP stack. I was wondering if you are still interested in trying to get to the bottom of the lwIP based stack? If so I have time to look at this over the next couple of months and would love to investigate it more.

Adam Green wrote:

That is very unfortunate! The device is completely unresponsive! I will look through the code again later today to see if there is anything in that sample which could obviously lead to such a state.

I looked through the disassembly of the sample that you and I were last experimenting with and while I found some other nasty issues, I didn't find anything that should cause the debugger to fail attaching. If you are still interested in looking into this issue on your machine, it would be good to go back to that last build I sent you and make sure that you can connect GDB when the code is running successfully to just rule out any simple configuration issues.

On a related note, I have had the last experiment we tried running on my mbed here. It has been running for the about 22 hours so far. I am going to continue letting it run tonight as well. I thought maybe it just wasn't getting enough traffic on my internal network which runs through a switched router so I actually moved it into the DMZ so that it would be getting hit with more packets from the Internet directly but it is still running like a champ. Edit: This code ran here in the DMZ for 36 hours without hanging or crashing. Why does nothing fail for the person who wants it to? Must be Murphy and his law mocking me :)

Let me know what you think.

-Adam

08 Aug 2013

Hi Adam

I'm in the opposite position. I have little spare time and at this point would simply like a working stack to develop on top of. It looks to me like PicoTCP will get there a lot sooner.

Regards Daniel

08 Aug 2013

I wish I could disagree with you but I can't :) Good luck with your project!

Maybe someone else on this thread which has had problems with the mbed networking stack can share with me their scenarios/code so that I can attempt to root cause and correct the problem(s)?

When people hit these problems, I take it that just hitting the reset button is able to restart the program? They don't need to resort to power cycling the device?

-Adam

06 Sep 2013

Hi Adam,

I had problems when I tried using the RTOS+EthernetInterface last year, and they seem similar. I've described them in this thread (with packet sniffing logs):

http://mbed.org/forum/bugs-suggestions/topic/3954/

I wound up switching back to NetServicesMin, which was much slower but ran reliably.

I plan to try looking into EthernetInterface again soon. If there are any tests you'd like me to do, or if you have any advice on where I might start my investigations, I'm all ears.

07 Sep 2013

Sorry, I didn't respond earlier. I just looked at the problem you described last year. I never hit that particular problem in my testing but I did hit other robustness issues which could lead to that type of problem: lack of thread safety, buffer overruns, etc. I have been making fixes to the networking library over the last month and I believe that they have now all made their way into the github repository but not to the online compiler here yet. I will try to create a fork over the weekend which has these most recent updates and share it with you to see if they might help you with your issues.

07 Sep 2013

S K UCI,

I have pulled in the latest network stack sources from the github repository and placed them in the following library:

Import libraryEthernetInterface

Deprecated fork of old network stack source from github. Please use official library instead: https://mbed.org/users/mbed_official/code/EthernetInterface/

If this version of the network stack still gives you problems then I will debug your issues and use the results to improve the robustness of the network stack.

Thanks for the help,

Adam