Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion

2507

Created On 04/29/23 06:03 AM - Last Modified 07/28/23 05:53 AM

Containers

Self-Hosted 22.06

Prisma Cloud Compute Edition Self-Hosted

Symptom

Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion with the following Error:

"failed to get sandbox ip: check network namespace closed: remove netns:unlinkat /var/run/netns/<cni-name> device or resource busy"

Environment

Prisma Cloud Compute (22.06 and above)
CentOS 7 with Containerd

Cause

Note : This Behavior is problematic and seems to be specific to Centos7 with Containerd, as reported by the Community at Github #3667

The issue is not specific to our Defenders, but to the fact that it mounts "/var/run" (or/and "/run") Host directory
The same issue can be easily replicated with any pod mounting "/var/run"

Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: var-run-mounter
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        volumeMounts:
        - name: var-run-mount
          mountPath: "/var/run"
        ports:
        - containerPort: 80
      volumes:
      - name: var-run-mount
        hostPath:
          path: "/var/run"

When the 'var-run-mounter' YAML is deployed, the resulting container gets all the special procfs "/run/netns/cni-..." filesystems (representing kernel network namespaces for all the running pods) mounted into it

$ ps auxf
...
root      4101  0.0  0.1 712432  6632 ?        Sl   07:24   0:00 /usr/bin/containerd-shim-runc-v2 -namespace http://k8s.io -id 0618101459a8b765d6c64a68b3f244a48e0afbf041daca8d0365aa92a1b1c897 -address /run/containerd/containerd.sock
root      4167  0.0  0.0  32620  3240 ?        Ss   07:24   0:00  _ nginx: master process nginx -g daemon off;

$ sudo grep 'cni-' /proc/4167/mountinfo
...

These mounts are created as 'private' : mount_namespace
As a result, when some of the other Running Pods are terminated, Containerd unmounts it's kernel network namespace "/run/netns/cni-...", but the un-mounting event is not propagated to the "var-run-mounter" container mentioned above, and it continues holding the mount causing the Error

Resolution

We can't fix this Platform issue (you may want to upgrade the OS), but the following workaround is possible on our end:

1. Modify the Defender's YAML as following

From

      - name: docker-sock-folder
          mountPath: "/var/run"
...
        - name: runc-proxy-sock-folder
          mountPath: "/run"

     - name: docker-sock-folder
          mountPath: "/var/run"
          mountPropagation: HostToContainer
...
        - name: runc-proxy-sock-folder
          mountPath: "/run"
          mountPropagation: HostToContainer

Note : HostToContainer' flag ensures the Mounting / Unmounting events are propagated from Host to Container : Kubernetes Volumes

2. Redeploy the Defender

Additional Information