EthernetInterface robustness testing

09 Sep 2013

Hi Adam,

Excellent, thank you! I will try this out and let you know what happens. Hopefully I'll have a chance to look at it next week.

Adam Green wrote:

S K UCI,

I have pulled in the latest network stack sources from the github repository and placed them in the following library:

Import libraryEthernetInterface

Deprecated fork of old network stack source from github. Please use official library instead: https://mbed.org/users/mbed_official/code/EthernetInterface/

If this version of the network stack still gives you problems then I will debug your issues and use the results to improve the robustness of the network stack.

Thanks for the help,

Adam

09 Sep 2013

Sounds like a plan. I look forward to your feedback.

-Adam

27 Sep 2013

Hi Adam,

I plugged the new EthernetInterface into my example program from last year (http://mbed.org/users/uci1/code/testEthIntfTCPSocket/). I then updated the mbed and mbed-rtos. I then changed all calls to "wait" to "Thread::wait". This new mbed program is published at:

http://mbed.org/users/uci1/code/testNewEthIntfTCPSocket/

I then connected the mbed to my laptop, where I've set up a fake wired connection with the appropriate IPs. I then ran the python script (below) on my computer, as well as wireshark to sniff the packets. The listening script is in:

http://mbed.org/media/uploads/uci1/testtcptwisted.py.zip

Doing that, I saw the same problem as last year: quite quickly, a packet from the mbed is lost, and the mbed resends the wrong packet, causing the unsuccessful tcp handshaking to continue forever.

Note that flipping on debugging, at least at 9600 baud, slows the mbed down enough that the connection is stable. Thus there might be some non-thread-safe thing happening somewhere (possibly in what I'm doing), but I don't know where.

Here is a screenshot of the tcp packet flow from wireshark. Near the bottom, you can see that the laptop is acknowledging only 2446 bytes received, while the mbed continually re-sends the packet-after-the-lost-packet, which would bring the total to 3223. There are 345 bytes that were never received by my laptop but that aren't being re-sent by the mbed.

If there's anything you'd like me to look into, please let me know. I'm at a bit of a loss as to what to look at.

/media/uploads/uci1/tcpflow.gif

27 Sep 2013

Thanks for trying it out. I will try to reproduce and debug next week.

30 Sep 2013

S K UCI,

I have reproduced your problem with both the online compiler and GCC. For me, it always gets hung up on the first data packet. It sends two data packets (there are also SYNC packets to open the socket) and the receiver never gets a good version of the first packet. One difference between my network capture and yours is that I do see the mbed attempt to resend that first packet but it always has a bad checksum. Maybe your packet sniffing software doesn't show packets which have a bad checksum?

00:00:00.000107 IP (tos 0x0, ttl 64, id 60539, offset 0, flags [DF], proto TCP (6), length 40)
    192.168.0.3.6666 > 192.168.0.148.49153: Flags [.], cksum 0xc6fb (correct), seq 1, ack 10, win 65535, length 0
00:00:02.750120 IP (tos 0x0, ttl 255, id 6, offset 0, flags [none], proto TCP (6), length 777)
    192.168.0.148.49153 > 192.168.0.3.6666: Flags [P.], cksum 0x1738 (incorrect -> 0x1936), seq 10:747, ack 1, win 2920, length 737
00:00:00.999865 IP (tos 0x0, ttl 255, id 7, offset 0, flags [none], proto TCP (6), length 512)
    192.168.0.148.49153 > 192.168.0.3.6666: Flags [P.], cksum 0x48fb (correct), seq 747:1219, ack 1, win 2920, length 472


It is interesting that one byte of the checksum is 2 too high and the other byte is 2 too low. I will investigate it more later today. At least I now have a repro that occurs while a debugger is attached.

Thanks for the thorough repro steps. They are a great help!

-Adam

05 Oct 2013

Thanks for looking into it, Adam!

I haven't seen any bad checksums. Wireshark shows all the packets (afaik) and all of them have the correct checksum. I wonder if we're seeing different effects. In any case, I look forward to seeing what you find out.

06 Oct 2013

S K UCI wrote:

Thanks for looking into it, Adam!

No problem. I just wish I had made more progress by now! I have spent a few hours looking at it but haven't nailed down the cause yet. I have had a few hypothesis that I have proved unlikely but I am still investigating others. I don't see anything wrong with your code that would lead to these issues directly. I suspect that it is just your call pattern/timing that causes a bug in the networking stack to raise its ugly head.

S K UCI wrote:

I wonder if we're seeing different effects.

I also believe that is the case. During my investigation, I have probably rerun your sample about 20 times under the debugger. Most times I have seen the bad checksum issue in my traces but at least twice I have seen the exact same segment loss that your traces show.

It may be a few weeks before I get back to looking at this bug as I have a few other things I need to attend to over the next couple of weeks.

-Adam

07 Oct 2013

I determined the cause of the checksum error that I was constantly hitting in my environment (it was a checksumming bug down in lwIP). Now that that bug is fixed I am seeing the dropped segment issue that you originally reported. It will be a week or two before I probably get to that issue though.

-Adam

20 Oct 2013

I started looking into this issue again last night now that I am back from Disney World :) I think I have narrowed down where the TCP segment is getting dropped.

I plan to look at it more this week.

-Adam

26 Oct 2013

S K UCI wrote:

Thanks for looking into it, Adam!

I am sorry that it took me so long to get to the bottom of the issues that you hit but I now believe that I have tracked down the two problems that your great stress test was encountering. There was a TCP checksumming bug that showed up when your TCP segments were composed of multiple buffers, some of which were odd in length. The other bug was encountered when the lwIP networking stack would attempt to send TCP segments which were composed of more than 2 buffers. The ethernet driver would fail to send such segments.

I have updated my fork of the networking library. You should be able to pull the updated version into your awesome test and see if you also see successful results.

Import libraryEthernetInterface

Deprecated fork of old network stack source from github. Please use official library instead: https://mbed.org/users/mbed_official/code/EthernetInterface/

I will test these changes a bit more next week and issue a pull request to the main github repo if everything looks good.

I hope that helps,

Adam

05 Nov 2013

Thanks, Adam, I will check it out as soon as I get a chance!

05 Nov 2013
  • double post. please remove
05 Nov 2013

Thanks, I appreciate it!

My github pull request with these fixes has been accepted. The fixes should show up in the official mbed library soon.

-Adam

15 Nov 2013

Hello !

After a long time I retested my Ethernet-Interface on mbed. I had problems that my programs crashed after some hours or few days - see my posting above http://mbed.org/forum/mbed/post/22014/

Now I updated all libraries(mbed, rtos, Ethernet...) and my program has been running for 5 days without any problem. I used the xively-Jumpstart-Demo, removed all sensor and display-libraries and only increment a counter and send it every 10 seconds to xivley.

Works perfect!

Your are my heroes!!!!!

I'll give an update after an even longer test-period or after tests with my real-world-program.

Thanks a lot!

Charly

15 Nov 2013

Thank the great work by Adam.

16 Nov 2013

Karl Zweimüller wrote:

Now I updated all libraries(mbed, rtos, Ethernet...) and my program has been running for 5 days without any problem. I used the xively-Jumpstart-Demo, removed all sensor and display-libraries and only increment a counter and send it every 10 seconds to xivley.

This is awesome news! I am happy to hear that things appear to be getting better :)

Karl Zweimüller wrote:

I'll give an update after an even longer test-period or after tests with my real-world-program.

Best of luck! I look forward to hearing your results. I appreciate your testing efforts!

16 Nov 2013

HM Yoong wrote:

Thank the great work by Adam.

And to Pablo Gindel and SK UCI for the work they did to test the networking layer, find issues, and provide me with the steps necessary to reproduce and debug. I really appreciate people taking the time to give me what I need to track down these issues!

25 Nov 2013

Hello,

we found the problem that we cannot open more than 3 UDP sockets, the 4th open() fails but it seems not to be a memory problem, there is still memory available. I already asked questions on this issue and started a discussion, but nobody ever replied.

It would be helpful, if at least anyone could confirm this behaviour. Since my project schedule is becoming tight, I'm looking for a way to solve this issue, and I would like to ask if anyone can help me here: - should I attempt to investigate the problem and perhaps modify the lwip-stack ? - or maybe switch to another development environment (code red...) for my mbed module (lpc1768) ?

thank you Matthias

25 Nov 2013

Matthias,

You could try opening up EthernetInterface/lwip/lwipopts.h in your networking application and adding this line to the header file (I added it just after the MEMP_NUM_TCP_PCB definition line:

#define MEMP_NUM_UDP_PCB            5

lwIP defaults to 4 UDP sockets (known as PCBs inside of lwIP). You might be hitting a limit of 3 user sockets since something else like DNS is using up the 4th entry. Bumping it up to 5 should allow you to open your 4th UDP connection.

I hope that helps,

Adam

26 Nov 2013

Hi Adam,

thank you, this helps, I can now open 4 UDP sockets.

I was curious if I could uses even more sockets, but setting the MEMP_NUM_UDP_PCB to 6 still gives me 4 sockets, not more. So I was trying to set LWIP_DNS=0 and LWIP_DHCP=0 - now I get compiler errors. I was able to work around the DHCP compiler errors, but the DNS code seems to be more twisted.

I'm not sure whether setting LWIP_DHCP=0 saves any memory or sockets or whatever, but since is an option, I think it should be possibe to disable DHCP without compiler errors. I think this can be done with some #IF statements wrapping the dhcp_start(), dhcp_release() and dhcp_stop() in EthernetInterface.cpp - this way of code development is new for me: can I simply change the code and the publish the file, or is anyone else in charge of these things ? I mean, is there any way of code review or whatever, before a change becomes public ?

Setting LWIP_DNS=0 is something, I'm not sure about the side effects, because this affects Endpoint::set_address - do you think it makes sense to have the option to disable DNS ?

regards, Matthias

26 Nov 2013

Matthias,

I don't know why you can only open 4 sockets when you set the number of UDP PCB objects to 6. Maybe it is running out of some other resource. If I had more time, I would put together a sample that does this and step through the failing open but I am swamped right now with trying to finish up some projects before the holidays. If you can publish such an example and send me a link then I could try taking a quick look see. How many UDP sockets does your application require?

On disabling DHCP and DNS, if it hurts then I wouldn't do it :) I currently have no reason to suspect that disabling those would even help you with your current issue anyway.

You can create a fork of a library with your changes and publish them. If you allow this published fork to be listed publicly then others can see it when they search for libraries but the original from the mbed team will ten to get priority. For your changes to make it into the official version, you would have to issue a pull request. Such a pull request will be code reviewed by the mbed team before they merge it in.

I hope that helps,

Adam

04 Feb 2014

Adam Green wrote:

My github pull request with these fixes has been accepted. The fixes should show up in the official mbed library soon.

I see that the official EthernetInterface library now has all of the fixes that I made last year to fix the issues that people had reported with regards to networking stability. Hopefully people now find the official network stack to be more robust.

Import libraryEthernetInterface

mbed IP library over Ethernet

-Adam