Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion

Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion

2507
Created On 04/29/23 06:03 AM - Last Modified 07/28/23 05:53 AM


Symptom


  • Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion with the following Error:
"failed to get sandbox ip: check network namespace closed: remove netns:unlinkat /var/run/netns/<cni-name> device or resource busy"


Environment


  • Prisma Cloud Compute (22.06 and above)
  • CentOS 7 with Containerd


Cause


Note : This Behavior is problematic and seems to be specific to Centos7 with Containerd, as reported by the Community at Github #3667
  • The issue is not specific to our Defenders, but to the fact that it mounts "/var/run" (or/and "/run") Host directory
  • The same issue can be easily replicated with any pod mounting "/var/run"
Example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: var-run-mounter
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        volumeMounts:
        - name: var-run-mount
          mountPath: "/var/run"
        ports:
        - containerPort: 80
      volumes:
      - name: var-run-mount
        hostPath:
          path: "/var/run"
  • When the 'var-run-mounter' YAML is deployed, the resulting container gets all the special procfs "/run/netns/cni-..." filesystems (representing kernel network namespaces for all the running pods) mounted into it
$ ps auxf
...
root      4101  0.0  0.1 712432  6632 ?        Sl   07:24   0:00 /usr/bin/containerd-shim-runc-v2 -namespace http://k8s.io -id 0618101459a8b765d6c64a68b3f244a48e0afbf041daca8d0365aa92a1b1c897 -address /run/containerd/containerd.sock
root      4167  0.0  0.0  32620  3240 ?        Ss   07:24   0:00  _ nginx: master process nginx -g daemon off;

$ sudo grep 'cni-' /proc/4167/mountinfo
...
  • These mounts are created as 'private' : mount_namespace
  • As a result, when some of the other Running Pods are terminated, Containerd unmounts it's kernel network namespace "/run/netns/cni-...", but the un-mounting event is not propagated to the "var-run-mounter" container mentioned above, and it continues holding the mount causing the Error


Resolution


  • We can't fix this Platform issue (you may want to upgrade the OS), but the following workaround is possible on our end:
1. Modify the Defender's YAML as following

From

      - name: docker-sock-folder
          mountPath: "/var/run"
...
        - name: runc-proxy-sock-folder
          mountPath: "/run"

to

     - name: docker-sock-folder
          mountPath: "/var/run"
          mountPropagation: HostToContainer
...
        - name: runc-proxy-sock-folder
          mountPath: "/run"
          mountPropagation: HostToContainer

Note : HostToContainer' flag ensures the Mounting / Unmounting events are propagated from Host to Container : Kubernetes Volumes

2. Redeploy the Defender



Additional Information


  • Virtual filesystems at "/var/run/netns" represent kernel network namespaces
  • Our Defender / fsmon processes do not mount these filesystems and are inherited from the Host


Actions
  • Print
  • Copy Link

    https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000g1cnCAA&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail