반응형
Node Problem Detector
Node Problem Detector(NPD) 란?
Node Problem Detector는 노드의 상태를 모니터링하고
하드웨어, 커널, 컨테이너 런타임 문제와 같은 일반적인 노드 문제를 감지하는 오픈소스 라이브러리.
일반적으로 Kubernetes에 Daemonset으로 배포되어 동작함.
수집하는 metrics 계층
= system metrics
host에서 발생한 "에러 정보"를 전달함
수집하는 metrics 정보 확인 링크
NPD는 config에 지정되어있는 설정에 따라 수집하는 지표 정보가 매우 유동적임
NPD 설정 확인
--system-log-monitors=
-
/config/kernel-monitor.json
-
/config/kernel-monitor-filelog.json
-
/config/docker-monitor.json
-
/config/docker-monitor-filelog.json
--custom-plugin-monitors=
-
/config/network-problem-monitor.json
--config.system-stats-monitor=
-
/config/system-stats-monitor.json
problem 지표 정보 목록은 아래와 같음
-
problem_counter{reason="AUFSUmountHung"} 0
-
problem_counter{reason="ConntrackFull"} 0
-
problem_counter{reason="DockerHung"} 0
-
problem_counter{reason="Ext4Error"} 0
-
problem_counter{reason="Ext4Warning"} 0
-
problem_counter{reason="FilesystemIsReadOnly"} 0
-
problem_counter{reason="IOError"} 0
-
problem_counter{reason="KernelOops"} 0
-
problem_counter{reason="MemoryReadError"} 0
-
problem_counter{reason="OOMKilling"} 0
-
problem_counter{reason="TaskHung"} 15
-
problem_counter{reason="UnregisterNetDevice"} 0
-
problem_gauge{reason="AUFSUmountHung",type="KernelDeadlock"} 0
-
problem_gauge{reason="DockerHung",type="KernelDeadlock"} 0
-
problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 0
전체 항목 설명
cpu_load_15m CPU average load (15m)
cpu_load_1m CPU average load (1m)
cpu_load_5m CPU average load (5m)
cpu_runnable_task_count The average number of runnable tasks in the run-queue during the last minute
cpu_usage_time CPU usage, in seconds
disk_avg_queue_len The average queue length on the disk
disk_bytes_used Disk bytes used, in Bytes
disk_io_time The IO time spent on the disk, in ms
disk_merged_operation_count Disk merged operations count
disk_operation_bytes_count Bytes transferred in disk operations
disk_operation_count Disk operations count
disk_operation_time Time spent in disk operations, in ms
disk_weighted_io The weighted IO on the disk, in ms
host_uptime The uptime of the operating system
memory_anonymous_used Anonymous memory usage, in Bytes. Summing values of all states yields the total anonymous memory used.
memory_bytes_used Memory usage by each memory state, in Bytes. Summing values of all states yields the total memory on the node.
memory_dirty_used Dirty pages usage, in Bytes. Dirty means the memory is waiting to be written back to disk, and writeback means the memory is actively being written back to disk.
memory_page_cache_used Page cache memory usage, in Bytes. Summing values of all states yields the total anonymous memory used.
memory_unevictable_used Unevictable memory usage, in Bytes
problem_counter Number of times a specific type of problem have occurred.
problem_gauge Whether a specific type of problem is affecting the node or not.
system_cpu_stat Cumulative time each cpu spent in various stages.
system_interrupts_total Total number of interrupts serviced (cumulative).
system_os_feature OS Features like GPU support, KTD kernel, third party modules as unknown modules. 1 if the feature is enabled and 0, if disabled.
system_processes_total Number of forks since boot.
system_procs_blocked Number of processes currently blocked.
system_procs_running Number of processes currently running.
어떤 에러 정보를 수집하는지에 대한 정보가 있음.
problem_gauge, problem_counter 차이
-
problem_gauge = 영구적인 문제
-
problem_counter = 일시적인 문제
problem_gauge의 카운트는
problem_counter항목의 condition에 명시되어있는 것과 매칭되면 증가함.
즉 problem_counter와 problem_gauge가 동시에 증가함.
수집한 metrics 정보를 전달하는 방식
= Pull 방식
npd는 HTTP 통신을 통해 Prometheus Server와 같은 metrics server가 process-exporter가 수집한 Metric Data를 가져갈 수 있게
/metrics 라는 HTTP 엔드포인트를 제공함.
exporter가 해당 엔드포인트를 제공하고 있어서 Server가 exporter의 엔드포인트로 HTTP GET 요청을 날려 Metric Data를 Pull방식으로 수집함.
default port
20257
Node Problem Detector 배포 방법
실행
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-problem-detector
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: magnum:podsecuritypolicy:node-problem-detector
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: magnum:podsecuritypolicy:privileged
subjects:
- kind: ServiceAccount
name: node-problem-detector
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: npd-binding
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-problem-detector
subjects:
- kind: ServiceAccount
name: node-problem-detector
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: npd
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: ${NODE_PROBLEM_DETECTOR_TAG}
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
selector:
matchLabels:
k8s-app: node-problem-detector
version: ${NODE_PROBLEM_DETECTOR_TAG}
template:
metadata:
labels:
k8s-app: node-problem-detector
version: ${NODE_PROBLEM_DETECTOR_TAG}
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: node-problem-detector
ports:
- containerPort: 20257
hostPort: 20257
protocol: TCP
image: ${_gcr_prefix}${NPD_REPO}:${NODE_PROBLEM_DETECTOR_TAG}
command:
- "/bin/sh"
- "-c"
# Pass both config to support both journald and syslog.
- "exec /node-problem-detector --logtostderr --system-log-monitors=/config/kernel-monitor.json,/config/kernel-monitor-filelog.json,/config/docker-monitor.json,/config/docker-monitor-filelog.json --logtostderr --enable-k8s-exporter=false --prometheus-address=0.0.0.0 --prometheus-port=20257 --custom-plugin-monitors=/config/network-problem-monitor.json --config.system-stats-monitor=/config/system-stats-monitor.json 2>&1 | tee /var/log/node-problem-detector.log"
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
- name: localtime
mountPath: /etc/localtime
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
- name: localtime
hostPath:
path: /etc/localtime
type: "FileOrCreate"
serviceAccountName: node-problem-detector
tolerations:
- operator: "Exists"
effect: "NoExecute"
- key: "CriticalAddonsOnly"
operator: "Exists"
EOF
NPD가 수집한 지표 정보 확인
ex) curl 10.200.230.1:20257/metrics
NPD 탐지 확인
npd 테스트 예제 참고
problem_counter의 KernelOops 항목 탐지 확인
아래와 같이 수동으로 에로 로그를 커널 로그에 추가
sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
이후 탐지 count 증가되는지 확인
curl localhost:20257/metrics
problem_counter의 OOM항목 탐지 확인
아래와 같이 수동으로 에로 로그를 커널 로그에 추가
sudo sh -c "echo 'Killed process 4070 (iksoon-test) total-vm:8192780kB, anon-rss:7231748kB, file-rss:0kB, shmem-rss:0kB' >> /dev/kmsg"
이후 탐지 count 증가되는지 확인
탐지 이후 count는 감소되지 않고 유지,
탐지 될 때 마다 count 증가함.
problem_gauge의 FilesystemIsReadOnly항목 탐지 확인
sudo sh -c "echo 'Remounting filesystem read-only' >> /dev/kmsg"
참고
반응형
'Kubernetes > Monitoring' 카테고리의 다른 글
kubernetes 모니터링 : process-exporter 란? (0) | 2024.09.22 |
---|---|
metricbeat container 실행 방법 (0) | 2024.09.22 |
Kubernetes 모니터링 : metricbeat 란? (0) | 2024.09.07 |
Kubernetes 모니터링 : kube-state-metrics 란? (0) | 2024.09.05 |
cAdvisor container로 배포 해서 확인 하는 방법(storage driver = Kafka) (1) | 2024.09.05 |