Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion
2507
Created On 04/29/23 06:03 AM - Last Modified 07/28/23 05:53 AM
Symptom
- Prisma Cloud Compute : Unable to delete Pods in the Environment as they are stuck in Terminating State after Deletion with the following Error:
"failed to get sandbox ip: check network namespace closed: remove netns:unlinkat /var/run/netns/<cni-name> device or resource busy"
Environment
- Prisma Cloud Compute (22.06 and above)
- CentOS 7 with Containerd
Cause
Note : This Behavior is problematic and seems to be specific to Centos7 with Containerd, as reported by the Community at Github #3667
- The issue is not specific to our Defenders, but to the fact that it mounts "/var/run" (or/and "/run") Host directory
- The same issue can be easily replicated with any pod mounting "/var/run"
apiVersion: apps/v1
kind: Deployment
metadata:
name: var-run-mounter
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
volumeMounts:
- name: var-run-mount
mountPath: "/var/run"
ports:
- containerPort: 80
volumes:
- name: var-run-mount
hostPath:
path: "/var/run"
- When the 'var-run-mounter' YAML is deployed, the resulting container gets all the special procfs "/run/netns/cni-..." filesystems (representing kernel network namespaces for all the running pods) mounted into it
$ ps auxf ... root 4101 0.0 0.1 712432 6632 ? Sl 07:24 0:00 /usr/bin/containerd-shim-runc-v2 -namespace http://k8s.io -id 0618101459a8b765d6c64a68b3f244a48e0afbf041daca8d0365aa92a1b1c897 -address /run/containerd/containerd.sock root 4167 0.0 0.0 32620 3240 ? Ss 07:24 0:00 _ nginx: master process nginx -g daemon off; $ sudo grep 'cni-' /proc/4167/mountinfo ...
- These mounts are created as 'private' : mount_namespace
- As a result, when some of the other Running Pods are terminated, Containerd unmounts it's kernel network namespace "/run/netns/cni-...", but the un-mounting event is not propagated to the "var-run-mounter" container mentioned above, and it continues holding the mount causing the Error
Resolution
- We can't fix this Platform issue (you may want to upgrade the OS), but the following workaround is possible on our end:
From
- name: docker-sock-folder
mountPath: "/var/run"
...
- name: runc-proxy-sock-folder
mountPath: "/run"
to
- name: docker-sock-folder
mountPath: "/var/run"
mountPropagation: HostToContainer
...
- name: runc-proxy-sock-folder
mountPath: "/run"
mountPropagation: HostToContainer
Note : HostToContainer' flag ensures the Mounting / Unmounting events are propagated from Host to Container : Kubernetes Volumes
2. Redeploy the Defender
Additional Information
- Virtual filesystems at "/var/run/netns" represent kernel network namespaces
- Our Defender / fsmon processes do not mount these filesystems and are inherited from the Host