Prisma Cloud Defender Deployment in OpenShift causing HAProxy Pods to Crash owing to No Buffer Space Available at Netlink Socket

8991

Created On 06/05/22 06:42 AM - Last Modified 10/26/22 04:14 AM

Cloud Infrastructure Protection

Prisma Cloud Compute Edition Self-Hosted

Symptom

Prisma Cloud Defender Deployment in OpenShift environment is causing the HAProxy Pods to Crash owing to No Buffer Space available at Netlink Socket.
Kubelet logs suggest HAProxy pods restarted owing to failure in answering liveness probes with following error message:
```
Container failed liveness probe
```
On the other hand, Defender Logs contain the following error messages:

DEBU Failed to read from .. :HostTCPEgress runtimeData: : netlink receive: recvmsg: no buffer space available

This behaviour may be reproduced in OpenShift by:

Recompiling the Defender with minimal netlink socket buffer
Enable Cloud Native Network Firewall (CNNF).
Pause the Defender process for some time to cause the buffer to overflow.

Environment

Prisma Cloud
OpenShift

Cause

Though Defender logs show that network disruption (userspace netlink buffer is full causing the packets to drop) occurs exclusively from CNNF Host monitoring, the CNNF feature has a fail-open model implying that packets that cannot be handled to provide a verdict will be re-injected to the stack and accepted.
However, Kernel versions < 4.6 do not honor the fail-open model when the user-space buffer is full, but only when the kernel queue is full.
For Kernel versions > = 4.6, this behaviour has been patched and packets will be re-injected and accepted in such cases : Honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails .
These findings have been confirmed via `/proc/net/netfilter/nfnetlink_queue : NfnetLinkQueue - file

Note:

RHEL 7.9 with Kernel version 3.10 drops such packets.
Ubuntu with Kernel version 5.8 accepts such packets.

Resolution

The Feasible methods available to resolve this are:

Disable Host Monitoring under Radars > Settings > disable 'Host network monitoring'. For more information, refer: Cloud Native Network Firewall (CNNF)

2. Upgrade the Host to a supported Kernel version of 4.18 or higher (eg. RHEL8 Host is based on Kernel 4.18 which contains the patch).

3. Consider increasing the size of the userspace netlink buffer via `/proc/sys/net/core/rmem_default` and `/proc/sys/net/core/rmem_max`. If this value will be equal to the nfqueue queue size defined by the Defender, packets should never be dropped since the Kernel will re-inject & accept them on the account that the queue is full : https://elixir.bootlin.com/linux/v3.10/source/net/netfilter/nfnetlink_queue_core.c#L516.

Additional Information

If you wish to continue using CNNF (Network monitoring tool), upgrade the Host to a supported Kernel version of 4.18 or higher so that the Kernel is able to re-inject packets in an organised manner to avoid occurrences of such issues.
Running the following command on the Hosts will give you the exact Kernel version.

uname -r