lwIP unexpected behaviour regarding allocation of NETBUF

23 Oct 2019

Hi,

Recently during testing the application regarding network connectivity (very frequent incoming and outgoing UDP frames) on K64F with mbed-5.13.4, I encountered the NETBUF number limitation (8 by default) which was noticeable with NSAPI_ERROR_PARAMETER returned by UDPSock::sendto(). After brief analysis it turned out that netbuf_new() used inside returns NULL and it is caused by lack of free NETBUFs. It seemed that increasing the default value should solve the issue but it didn't.

Even with higher number of NETBUFs (16, 32, ...) set in lwIP config file (MEMP_NUM_NETBUF), I was still encountering the behaviour that even if there were some netbufs free (only 4 of 16 used), netbuf_new() was returning NULL. Placing the diagnostic printfs in memp_alloc() and memp_free() showed that during runtime *memp->next value used by (de)allocator to point the next free netbuf to be used in next allocation suddenly points to somewhere in lwip_ram_heap. Such netbuf element taken from lwip_ram_heap during allocation, has its next pointer set to NULL and this way during next netbuf_new() call the NULL is returned.

I wonder if such behaviour of (unexpected in my opinion) *memp->next pointer change to the memory outside the memp_netbufs list of available elements is correct ?

After another brief debug with focus on memp_alloc() and memp_free(), it looks like none of these functions make this unexpected change - this wrong value is simply already there while these functions are called. It looks like some other code wrote it there. Unfortunately conditional hardware watchpoints do not work and software ones are too slow to spot the culprit in reasonable time, thus I thought about other approach. Use of MPU comes to my mind and making the memp->next read-only on netbufs list on exit on these functions. I haven't had time, however to check it yet. If you have other idea or think that using MPU in this case is an overkill or not feasible, please share it with me. Many thanks !

Here is the brief log I collected during testing :

memp_init_pool() for MEMP_NETBUF with 16 netbufs set :

MEMP_NETBUF[0] 0x2001caa0 (next 0x0)
MEMP_NETBUF[1] 0x2001cab0 (next 0x2001caa0)
MEMP_NETBUF[2] 0x2001cac0 (next 0x2001cab0)
MEMP_NETBUF[3] 0x2001cad0 (next 0x2001cac0)
MEMP_NETBUF[4] 0x2001cae0 (next 0x2001cad0)
MEMP_NETBUF[5] 0x2001caf0 (next 0x2001cae0)
MEMP_NETBUF[6] 0x2001cb00 (next 0x2001caf0)
MEMP_NETBUF[7] 0x2001cb10 (next 0x2001cb00)
MEMP_NETBUF[8] 0x2001cb20 (next 0x2001cb10)
MEMP_NETBUF[9] 0x2001cb30 (next 0x2001cb20)
MEMP_NETBUF[10] 0x2001cb40 (next 0x2001cb30)
MEMP_NETBUF[11] 0x2001cb50 (next 0x2001cb40)
MEMP_NETBUF[12] 0x2001cb60 (next 0x2001cb50)
MEMP_NETBUF[13] 0x2001cb70 (next 0x2001cb60)
MEMP_NETBUF[14] 0x2001cb80 (next 0x2001cb70)
MEMP_NETBUF[15] 0x2001cb90 (next 0x2001cb80)

and during receiving and sending UDP frames (750 bytes in size with rate up to 10 frames/sec in total incoming/outgoing), the next pointer suddenly points to lwip_ram_heap (notice lines 1 and 4) :

NETBUF allocated 0x2001cb00 (next 0x20015d54)
NETBUF deallocate 0x2001cb00 (next 0x20015d54)
NETBUF allocated 0x2001cb00 (next 0x20015d54)
NETBUF allocated 0x20015d54 (next 0x0)
MEMP_NETBUF 4 of 16 used
ERROR: Socket send error: -3003
28 Oct 2019

Hi Leszek,

It looks like a thread safe problem.

netbuf_new and netbuf_delete are not thread safe, because they are using a shared memory pool, so it is likely the allocating procedure reads the wrong value.

Please use mutex while using the non-thread-safe functions, and use it carefully to prevent deadlock.

Regards, Desmond

06 Nov 2019

Hi Desmond,

Thank you for your response. Isn't it bug in lwIP stack that should be fixed in a way you described ? The only functions I used during my tests were UDPSocket::recvfrom() and UDPSocket::sendto() called from one application thread. I did not use netbuf_new() and netbuf_delete() directly.

EDIT2: NULL dereference from edit below is not important anymore. The hard fault is not caused by NULL pointer dereference but dereferencing a wrong memory address pretends to be a pbuf struct. Here are a few of last debug logs right before the hard fault :

buf = 0x20013a6c buf->p = 0x20015818 buf = 0x20013a9c buf->p = 0x20015e34 buf = 0x20013a24 buf->p = 0x20016450

... here the last UDP frame is being received, then

buf = 0x20013ab4 buf->p = 0x4f3245da

EDIT: Today I enabled dynamic memory allocation in lwIP with MEMP_MEM_MALLOC. Testing the same application as before, terminates with hardfault - it happens randomly either on thumb2_memcpy() or netconn_recv_data() on line

len = netbuf_len((struct netbuf *)buf);

Strange fact is that checking the buf and buf->p against NULL just before the affected line, does not prevent from hardfault caused by the same line.

if (buf != NULL && ((struct netbuf *)buf)->p != NULL)
   len = netbuf_len((struct netbuf *)buf);

It looks like the memory pointed by buf changes in a meantime (and p pointer becomes NULL), i.e. between 'if' and the next line.

I wonder if all these problems are platform specific (K64F), unfortunately I do not have any other as powerful as this one to check it on.

Regards, Leszek

06 Nov 2019

Hi again,

I am writing separate post to make the conversation more readable.

After further analysis, it seems that I found the solution to both approaches - with MEMP_MEM_MALLOC disabled (first case) and enabled (second case). It turns out that increasing the MB_SIZE (mailbox size) from 8 to 16, together with NETBUF_NUM and PBUF_NUM (8 to 16) solves the issue (so far I was increasing only the last two). Unfortunately MB_SIZE macro as the maximum limit of queue length is not configurable from external.

I am not sure if increasing MB_SIZE is desirable at all (I do not know the exact lwIP architecture), hence it maybe only masks the real bug in lwIP. What do you think ?