Prisma Cloud Defender Deployment in OpenShift causing HAProxy Pods to Crash owing to No Buffer Space Available at Netlink Socket
8991
Created On 06/05/22 06:42 AM - Last Modified 10/26/22 04:14 AM
Symptom
- Prisma Cloud Defender Deployment in OpenShift environment is causing the HAProxy Pods to Crash owing to No Buffer Space available at Netlink Socket.
- Kubelet logs suggest HAProxy pods restarted owing to failure in answering liveness probes with following error message:
Container failed liveness probe
- On the other hand, Defender Logs contain the following error messages:
DEBU Failed to read from .. :HostTCPEgress runtimeData: : netlink receive: recvmsg: no buffer space availableThis behaviour may be reproduced in OpenShift by:
- Recompiling the Defender with minimal netlink socket buffer
- Enable Cloud Native Network Firewall (CNNF).
- Pause the Defender process for some time to cause the buffer to overflow.
Environment
- Prisma Cloud
- OpenShift
Cause
- Though Defender logs show that network disruption (userspace netlink buffer is full causing the packets to drop) occurs exclusively from CNNF Host monitoring, the CNNF feature has a fail-open model implying that packets that cannot be handled to provide a verdict will be re-injected to the stack and accepted.
- However, Kernel versions < 4.6 do not honor the fail-open model when the user-space buffer is full, but only when the kernel queue is full.
- For Kernel versions > = 4.6, this behaviour has been patched and packets will be re-injected and accepted in such cases : Honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails .
- These findings have been confirmed via `/proc/net/netfilter/nfnetlink_queue : NfnetLinkQueue - file
- RHEL 7.9 with Kernel version 3.10 drops such packets.
- Ubuntu with Kernel version 5.8 accepts such packets.
Resolution
The Feasible methods available to resolve this are:
- Disable Host Monitoring under Radars > Settings > disable 'Host network monitoring'. For more information, refer: Cloud Native Network Firewall (CNNF)
2. Upgrade the Host to a supported Kernel version of 4.18 or higher (eg. RHEL8 Host is based on Kernel 4.18 which contains the patch).
3. Consider increasing the size of the userspace netlink buffer via `/proc/sys/net/core/rmem_default` and `/proc/sys/net/core/rmem_max`. If this value will be equal to the nfqueue queue size defined by the Defender, packets should never be dropped since the Kernel will re-inject & accept them on the account that the queue is full : https://elixir.bootlin.com/linux/v3.10/source/net/netfilter/nfnetlink_queue_core.c#L516.
Additional Information
- If you wish to continue using CNNF (Network monitoring tool), upgrade the Host to a supported Kernel version of 4.18 or higher so that the Kernel is able to re-inject packets in an organised manner to avoid occurrences of such issues.
- Running the following command on the Hosts will give you the exact Kernel version.
uname -r
- If using Amazon Linux 2 Hosts, refer their release notes here: https://docs.aws.amazon.com/AL2/latest/relnotes/relnotes-al2.html