Prisma Cloud Compute:实施递归文件完整性管理规则时出现内存不足问题
3952
Created On 09/09/22 14:01 PM - Last Modified 12/27/24 08:34 AM
Symptom
>>> 如果启用 twistlock, 中央处理器使用率会持续增加,如果我们停止 twistlock, 中央处理器图表会恢复正常。
第一个症状:
- 不使用 Prisma Cloud Compute
- 启动限制较低的 Pod
- Pod 运行无问题。
- 一切都好。
- 使用 Prisma Cloud Compute (Twistlock)
- 启动低限吊舱
- Pod 因内存不足而被终止。
- Pod 不工作。
症状二:
- 内核日志显示,由于资源不足,正在创建的容器被终止-
runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?)
第三个症状:
第四个症状:
- 如果您检查进程,例如通过“ ps aufx ”,您会注意到crio进程加载很重,而且名为“ fsmon ”的特定进程消耗大量资源。如果您 grep 进程的 PID 并将其终止,资源使用率应该会恢复正常。
5h症状:
- journalctl 输出中的错误表明 CRI-O 无法启动容器/有明显的延迟:
Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded
- 特别是很多“名称被保留”错误,如下所示:
Aug 19 06:40:49 muc9-4wtp8-worker-b-gen9-mbxcv crio[2252]: time="2022-08-19 06:40:49.510901803Z" level=warning msg="error reserving ctr name k8s_frontoffice-analytics-domain_frontoffice-analytics-deployment-1-zz5s8_reef-an-maxi-uat_ea9c7d33-8fea-4165-a534-106ce6c33e29_17 for id 90615636aa18fdc2a17bfef076500dfd14932a7109ca8e662c45b4708f0364bf: name is reserved"
其他值得关注的症状日志:
Aug 19 10:30:46 muc9-4wtp8-worker-b-gen9-mbxcv crio[2252]: time="2022-08-19 10:30:45.691591825Z" level=error msg="Container creation error: time="2022-08-19T10:30:42Z" level=warning msg="unable to get oom kill count" error="open /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podff7a0d7a_17b4_4777_9146_94fc41681318.slice/crio-fd00abb205097e9244014a63265fb0f656dab4ebd9dfe1e800d0a10ec55fc42d.scope/memory.oom_control: no such file or directory"\ntime="2022-08-19T10:30:44Z" level=error msg="runc create failed: unable to start container process: unable to apply cgroup configuration: Timeout waiting for systemd to create crio-fd00abb205097e9244014a63265fb0f656dab4ebd9dfe1e800d0a10ec55fc42d.scope"\n" id=dff10a34-f28e-4370-9fff-8e895148d11d name=/runtime.v1.RuntimeService/CreateContainer Aug 19 10:30:46 muc9-4wtp8-worker-b-gen9-mbxcv crio[2252]: time="2022-08-19 10:30:46.086315234Z" level=info msg="createCtr: deleting container ID fd00abb205097e9244014a63265fb0f656dab4ebd9dfe1e800d0a10ec55fc42d from idIndex" id=dff10a34-f28e-4370-9fff-8e895148d11d name=/runtime.v1.RuntimeService/CreateContainer
Environment
- Prisma 云计算
- 集群防御者
- Openshift/CRIO
Cause
- 该问题的根本原因是fsmon进程(文件系统监视进程)消耗了大量中央处理器资源,从而对整个集群环境的性能产生了严重影响。
- In particular, if the customer has a lot of ?File Integrity? rules set up on his/her Cconsole, especially if recursive (i.e., tracking recursively an entire file-system tree like this
"path": "/bin", "recursive": true, ...), then this feature can have severe performance impact, depending on the configured rules, when tracking lots of directories recursively.
- In this case, the file system monitor needs to track and scan many files using
fsmon, naturally introducing delays for the containers creation (containers created withrunc), as these fail (or timeout as we saw above), the container runtime (crio) continues trying to spawn the containers again and again, making the system extremely busy.
Resolution
- 您可以在控制台中终止fsmon进程或使用主机运行时规则删除/减少递归文件完整性管理规则。
- 增加节点上的中央处理器资源
Additional Information
您可以配置FIM 来检测:
- 读取或写入敏感文件,例如证书、机密和配置文件。
- 写入文件系统的二进制文件。
- 异常安装的软件。例如,由 apt-get 以外的程序写入文件系统的文件。
更多信息: 主机运行时防御