Prisma Cloud Compute: Out of Memory issue when recursive File Integrity Management rules in place
3950
Created On 09/09/22 14:01 PM - Last Modified 12/12/24 18:06 PM
Symptom
>>> CPU usage is consistently increasing if twistlock is enabled and If we stop twistlock, CPU graphs return to normal.
1st symptom:
- Without Prisma Cloud Compute
- Start pod which has low limit
- Pod works without issue.
- Everything is ok.
- With Prisma Cloud Compute (Twistlock)
- Start low limit pod
- Pod is getting killed because of out off memory.
- Pod does not work.
2nd symptom:
- The kernel log shows the container creating is killed due to insufficient resources-
runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?)
3rd symptom:
4th symptom:
- If you check processes, for example by "ps aufx" you'd notice that the crio process is under heavy load but also that a specific process named "fsmon" consuming lots of resources. If you grep the PID of the process and kill it, the resource usage should get back to normal.
5h symptom:
- errors in journalctl output indicating that CRI-O fails starting containers / has significant delays:
Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded
- in particular a lot of "name is reserved" errors like the following:
Aug 19 06:40:49 muc9-4wtp8-worker-b-gen9-mbxcv crio[2252]: time="2022-08-19 06:40:49.510901803Z" level=warning msg="error reserving ctr name k8s_frontoffice-analytics-domain_frontoffice-analytics-deployment-1-zz5s8_reef-an-maxi-uat_ea9c7d33-8fea-4165-a534-106ce6c33e29_17 for id 90615636aa18fdc2a17bfef076500dfd14932a7109ca8e662c45b4708f0364bf: name is reserved"
Other symptomatic logs of interest:
Aug 19 10:30:46 muc9-4wtp8-worker-b-gen9-mbxcv crio[2252]: time="2022-08-19 10:30:45.691591825Z" level=error msg="Container creation error: time="2022-08-19T10:30:42Z" level=warning msg="unable to get oom kill count" error="open /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podff7a0d7a_17b4_4777_9146_94fc41681318.slice/crio-fd00abb205097e9244014a63265fb0f656dab4ebd9dfe1e800d0a10ec55fc42d.scope/memory.oom_control: no such file or directory"\ntime="2022-08-19T10:30:44Z" level=error msg="runc create failed: unable to start container process: unable to apply cgroup configuration: Timeout waiting for systemd to create crio-fd00abb205097e9244014a63265fb0f656dab4ebd9dfe1e800d0a10ec55fc42d.scope"\n" id=dff10a34-f28e-4370-9fff-8e895148d11d name=/runtime.v1.RuntimeService/CreateContainer Aug 19 10:30:46 muc9-4wtp8-worker-b-gen9-mbxcv crio[2252]: time="2022-08-19 10:30:46.086315234Z" level=info msg="createCtr: deleting container ID fd00abb205097e9244014a63265fb0f656dab4ebd9dfe1e800d0a10ec55fc42d from idIndex" id=dff10a34-f28e-4370-9fff-8e895148d11d name=/runtime.v1.RuntimeService/CreateContainer
Environment
- Prisma Cloud Compute
- Cluster Defender
- Openshift/CRIO
Cause
- The root cause of this issue is the fsmon process (file system monitoring process) consuming high CPU resources, hence having severe performance impact on the whole cluster environment.
- In particular, if the customer has a lot of “File Integrity” rules set up on his/her Cconsole, especially if recursive (i.e., tracking recursively an entire file-system tree like this
"path": "/bin", "recursive": true, ...), then this feature can have severe performance impact, depending on the configured rules, when tracking lots of directories recursively.
- In this case, the file system monitor needs to track and scan many files using
fsmon, naturally introducing delays for the containers creation (containers created withrunc), as these fail (or timeout as we saw above), the container runtime (crio) continues trying to spawn the containers again and again, making the system extremely busy.
Resolution
- You can kill the fsmon process or delete/reduce the recursive File Integrity Management rules with the host runtime rules in the Console.
- Increase CPU resources on the nodes
Additional Information
You can configure to FIM to detect:
- Reads or writes to sensitive files, such as certificates, secrets, and configuration files.
- Binaries written to the file system.
- Abnormally installed software. For example, files written to a file system by programs other than apt-get.
For more information: Runtime defense for hosts