이쿠의 슬기로운 개발생활

함께 성장하기 위한 보안 개발자 EverNote 내용 공유

Kubernetes/Monitoring

kubernetes 모니터링 : Node Problem Detector(NPD)란?

이쿠우우 2024. 9. 22. 21:25
반응형

Node Problem Detector

 

Node Problem Detector(NPD) 란?

Node Problem Detector는 노드의 상태를 모니터링하고
하드웨어, 커널, 컨테이너 런타임 문제와 같은 일반적인 노드 문제를 감지하는 오픈소스 라이브러리.
일반적으로 Kubernetes에 Daemonset으로 배포되어 동작함.
 

수집하는 metrics 계층

=  system metrics
host에서 발생한 "에러 정보"를 전달함
 
 

수집하는 metrics 정보 확인 링크

NPD는 config에 지정되어있는 설정에 따라 수집하는 지표 정보가 매우 유동적임
 

NPD 설정 확인

--system-log-monitors=
--custom-plugin-monitors=
--config.system-stats-monitor=
 
problem 지표 정보 목록은 아래와 같음
  • problem_counter{reason="AUFSUmountHung"} 0
  • problem_counter{reason="ConntrackFull"} 0
  • problem_counter{reason="DockerHung"} 0
  • problem_counter{reason="Ext4Error"} 0
  • problem_counter{reason="Ext4Warning"} 0
  • problem_counter{reason="FilesystemIsReadOnly"} 0
  • problem_counter{reason="IOError"} 0
  • problem_counter{reason="KernelOops"} 0
  • problem_counter{reason="MemoryReadError"} 0
  • problem_counter{reason="OOMKilling"} 0
  • problem_counter{reason="TaskHung"} 15
  • problem_counter{reason="UnregisterNetDevice"} 0
  • problem_gauge{reason="AUFSUmountHung",type="KernelDeadlock"} 0
  • problem_gauge{reason="DockerHung",type="KernelDeadlock"} 0
  • problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 0
 
전체 항목 설명
cpu_load_15m CPU average load (15m)
cpu_load_1m CPU average load (1m)
cpu_load_5m CPU average load (5m)
cpu_runnable_task_count The average number of runnable tasks in the run-queue during the last minute
cpu_usage_time CPU usage, in seconds
disk_avg_queue_len The average queue length on the disk
disk_bytes_used Disk bytes used, in Bytes
disk_io_time The IO time spent on the disk, in ms
disk_merged_operation_count Disk merged operations count
disk_operation_bytes_count Bytes transferred in disk operations
disk_operation_count Disk operations count
disk_operation_time Time spent in disk operations, in ms
disk_weighted_io The weighted IO on the disk, in ms
host_uptime The uptime of the operating system
memory_anonymous_used Anonymous memory usage, in Bytes. Summing values of all states yields the total anonymous memory used.
memory_bytes_used Memory usage by each memory state, in Bytes. Summing values of all states yields the total memory on the node.
memory_dirty_used Dirty pages usage, in Bytes. Dirty means the memory is waiting to be written back to disk, and writeback means the memory is actively being written back to disk.
memory_page_cache_used Page cache memory usage, in Bytes. Summing values of all states yields the total anonymous memory used.
memory_unevictable_used Unevictable memory usage, in Bytes
problem_counter Number of times a specific type of problem have occurred.
problem_gauge Whether a specific type of problem is affecting the node or not.
system_cpu_stat Cumulative time each cpu spent in various stages.
system_interrupts_total Total number of interrupts serviced (cumulative).
system_os_feature OS Features like GPU support, KTD kernel, third party modules as unknown modules. 1 if the feature is enabled and 0, if disabled.
system_processes_total Number of forks since boot.
system_procs_blocked Number of processes currently blocked.
system_procs_running Number of processes currently running.
 
어떤 에러 정보를 수집하는지에 대한 정보가 있음.
 
 
problem_gauge, problem_counter 차이
  • problem_gauge = 영구적인 문제
  • problem_counter = 일시적인 문제
 
problem_gauge의 카운트는

 

problem_counter항목의 condition에 명시되어있는 것과 매칭되면 증가함.
즉 problem_counter와 problem_gauge가 동시에 증가함.

 

수집한 metrics 정보를 전달하는 방식

=  Pull 방식
npd는 HTTP 통신을 통해 Prometheus Server와 같은 metrics server가 process-exporter가 수집한 Metric Data를 가져갈 수 있게 
/metrics 라는 HTTP 엔드포인트를 제공함.
exporter가 해당 엔드포인트를 제공하고 있어서 Server가 exporter의 엔드포인트로 HTTP GET 요청을 날려 Metric Data를 Pull방식으로 수집함.
 
 

default port

20257

 

Node Problem Detector 배포 방법

 
실행
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-problem-detector
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: magnum:podsecuritypolicy:node-problem-detector
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: magnum:podsecuritypolicy:privileged
subjects:
- kind: ServiceAccount
  name: node-problem-detector
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: npd-binding
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-problem-detector
subjects:
- kind: ServiceAccount
  name: node-problem-detector
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: npd
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: ${NODE_PROBLEM_DETECTOR_TAG}
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector
      version: ${NODE_PROBLEM_DETECTOR_TAG}
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: ${NODE_PROBLEM_DETECTOR_TAG}
        kubernetes.io/cluster-service: "true"
    spec:
      containers:
      - name: node-problem-detector
        ports:
        - containerPort: 20257
          hostPort: 20257
          protocol: TCP
        image: ${_gcr_prefix}${NPD_REPO}:${NODE_PROBLEM_DETECTOR_TAG}
        command:
        - "/bin/sh"
        - "-c"
        # Pass both config to support both journald and syslog.
        - "exec /node-problem-detector --logtostderr --system-log-monitors=/config/kernel-monitor.json,/config/kernel-monitor-filelog.json,/config/docker-monitor.json,/config/docker-monitor-filelog.json --logtostderr --enable-k8s-exporter=false --prometheus-address=0.0.0.0 --prometheus-port=20257 --custom-plugin-monitors=/config/network-problem-monitor.json --config.system-stats-monitor=/config/system-stats-monitor.json 2>&1 | tee /var/log/node-problem-detector.log"
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: log
          mountPath: /var/log
        - name: localtime
          mountPath: /etc/localtime
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: localtime
        hostPath:
          path: /etc/localtime
          type: "FileOrCreate"
      serviceAccountName: node-problem-detector
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - key: "CriticalAddonsOnly"
        operator: "Exists"
EOF
 
 

NPD가 수집한 지표 정보 확인

 
ex) curl 10.200.230.1:20257/metrics

 

NPD 탐지 확인

 
npd 테스트 예제 참고
 

problem_counter의 KernelOops 항목 탐지 확인

아래와 같이 수동으로 에로 로그를 커널 로그에 추가
sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
 
이후 탐지 count 증가되는지 확인
curl localhost:20257/metrics

 

 

problem_counter의 OOM항목 탐지 확인

아래와 같이 수동으로 에로 로그를 커널 로그에 추가
sudo sh -c "echo 'Killed process 4070 (iksoon-test) total-vm:8192780kB, anon-rss:7231748kB, file-rss:0kB, shmem-rss:0kB' >> /dev/kmsg"
 
이후 탐지 count 증가되는지 확인

탐지 이후 count는 감소되지 않고 유지,
탐지 될 때 마다 count 증가함.
 

problem_gauge의 FilesystemIsReadOnly항목 탐지 확인

 
sudo sh -c "echo 'Remounting filesystem read-only' >> /dev/kmsg"


 


참고
 
 
반응형