Skip to main content Link Menu Expand (external link) Document Search Copy Copied

CloudWatch Container Insight

AWS 쿠버네티스에서 실행되는 컨테이너의 리소스 사용량 및 성능 데이터를 자동으로 수집하여 CloudWatch 대시보드에서 시각화하는 기능으로, AWS 콘솔에서 자동으로 대시보드를 생성해줍니다.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html

Container Insights - Amazon CloudWatch

Container Insights Use CloudWatch Container Insights to collect, aggregate, and summarize metrics and logs from your containerized applications and microservices. Container Insights is available for Amazon Elastic Container Service (Amazon ECS), Amazon Ela

docs.aws.amazon.com

CloudWatch Agent

EC2, EKS 노드 및 애플리케이션의 메트릭과 로그를 수집하여 CloudWatch로 전송하는 AWS 공식 에이전트입니다.

https://hellouz818.tistory.com/33

다음과 같은 환경을 구성하기 위해 Cloudwatch Container Observability를 설치하도록 하겠습니다.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html

Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart - Amazon CloudWatch

Any custom configuration that you provide using additional configuration settings overrides the default configuration used by the agent. Be cautious not to unintentionally disable functionality that is enabled by default, such as Container Insights with en

docs.aws.amazon.com

img

Cloudwatch에 전송할 수 있는 IAM 서비스 어카운트 권한을 먼저 생성한 후 addon을 배포합니다.

# IRSA 생성
❯ eksctl create iamserviceaccount \                                                            
  --name cloudwatch-agent \
  --namespace amazon-cloudwatch --cluster $CLUSTER_NAME \
  --role-name $CLUSTER_NAME-cloudwatch-agent-role \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --role-only \
  --approve
  
# add on 확인
❯ aws eks list-addons --cluster-name myeks --output table
--------------------------
|       ListAddons       |
+------------------------+
||        addons        ||
|+----------------------+|
||  aws-ebs-csi-driver  ||
||  coredns             ||
||  kube-proxy          ||
||  metrics-server      ||
||  vpc-cni             ||
|+----------------------+|

❯ aws eks create-addon --addon-name amazon-cloudwatch-observability --cluster-name myeks --service-account-role-arn arn:aws:iam::xxx:role/myeks-cloudwatch-agent-role
{
    "addon": {
        "addonName": "amazon-cloudwatch-observability",
        "clusterName": "myeks",
        "status": "CREATING",
        "addonVersion": "v3.3.1-eksbuild.1",
        "health": {
            "issues": []
        },
        "addonArn": "arn:aws:eks:ap-northeast-2:xxx:addon/myeks/amazon-cloudwatch-observability/aecaa964-5132-4449-73b3-b47db792e3ed",
        "createdAt": "2025-03-02T02:37:15.576000+09:00",
        "modifiedAt": "2025-03-02T02:37:15.591000+09:00",
        "serviceAccountRoleArn": "arn:aws:iam::xxx:role/myeks-cloudwatch-agent-role",
        "tags": {}
    }
}

# add on 확인
❯ aws eks list-addons --cluster-name myeks --output table
---------------------------------------
|             ListAddons              |
+-------------------------------------+
||              addons               ||
|+-----------------------------------+|
||  amazon-cloudwatch-observability  || # 추가
||  aws-ebs-csi-driver               ||
||  coredns                          ||
||  kube-proxy                       ||
||  metrics-server                   ||
||  vpc-cni                          ||
|+-----------------------------------+|

# 설치 확인
❯ kubectl get crd | grep -i cloudwatch                                                     
amazoncloudwatchagents.cloudwatch.aws.amazon.com   2025-03-01T17:37:34Z
dcgmexporters.cloudwatch.aws.amazon.com            2025-03-01T17:37:35Z
instrumentations.cloudwatch.aws.amazon.com         2025-03-01T17:37:35Z
neuronmonitors.cloudwatch.aws.amazon.com           2025-03-01T17:37:36Z

❯ kubectl get ds,pod,cm,sa,amazoncloudwatchagent -n amazon-cloudwatch                      
NAME                                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
daemonset.apps/cloudwatch-agent                              3         3         3       3            3           kubernetes.io/os=linux     76s
daemonset.apps/cloudwatch-agent-windows                      0         0         0       0            0           kubernetes.io/os=windows   76s
daemonset.apps/cloudwatch-agent-windows-container-insights   0         0         0       0            0           kubernetes.io/os=windows   76s
daemonset.apps/dcgm-exporter                                 0         0         0       0            0           kubernetes.io/os=linux     76s
daemonset.apps/fluent-bit                                    3         3         3       3            3           kubernetes.io/os=linux     83s
daemonset.apps/fluent-bit-windows                            0         0         0       0            0           kubernetes.io/os=windows   83s
daemonset.apps/neuron-monitor                                0         0         0       0            0           <none>                     76s

NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/amazon-cloudwatch-observability-controller-manager-6f76854n2crw   1/1     Running   0          83s
pod/cloudwatch-agent-9b29g                                            1/1     Running   0          76s
pod/cloudwatch-agent-gjnkc                                            1/1     Running   0          76s
pod/cloudwatch-agent-w7dzp                                            1/1     Running   0          76s
pod/fluent-bit-drkgz                                                  1/1     Running   0          83s
pod/fluent-bit-f2xpd                                                  1/1     Running   0          83s
pod/fluent-bit-m5zh6                                                  1/1     Running   0          83s

NAME                                                    DATA   AGE
configmap/cloudwatch-agent                              1      76s
configmap/cloudwatch-agent-windows                      1      76s
configmap/cloudwatch-agent-windows-container-insights   1      76s
configmap/cwagent-clusterleader                         0      60s
configmap/dcgm-exporter-config-map                      2      76s
configmap/fluent-bit-config                             5      85s
configmap/fluent-bit-windows-config                     5      85s
configmap/kube-root-ca.crt                              1      87s
configmap/neuron-monitor-config-map                     1      76s

NAME                                                                SECRETS   AGE
serviceaccount/amazon-cloudwatch-observability-controller-manager   0         85s
serviceaccount/cloudwatch-agent                                     0         85s
serviceaccount/dcgm-exporter-service-acct                           0         76s
serviceaccount/default                                              0         87s
serviceaccount/neuron-monitor-service-acct                          0         76s

NAME                                                                                          MODE        VERSION   READY   AGE   IMAGE   MANAGEMENT
amazoncloudwatchagent.cloudwatch.aws.amazon.com/cloudwatch-agent                              daemonset   0.0.0             83s           managed
amazoncloudwatchagent.cloudwatch.aws.amazon.com/cloudwatch-agent-windows                      daemonset   0.0.0             82s           managed
amazoncloudwatchagent.cloudwatch.aws.amazon.com/cloudwatch-agent-windows-container-insights   daemonset   0.0.0             83s           managed

❯ kubectl describe clusterrole cloudwatch-agent-role amazon-cloudwatch-observability-manager-role
Name:         cloudwatch-agent-role # CloudWatch agent가 EKS 클러스터 메트릭 및 로그 수집을 위한 권한 - 메트릭 수집할 수 있도록 필요한 리소스에 대한 읽기 권한 부여
Labels:       app.kubernetes.io/instance=amazon-cloudwatch-observability
              app.kubernetes.io/managed-by=EKS
              app.kubernetes.io/name=amazon-cloudwatch-observability
              app.kubernetes.io/version=1.0.0
Annotations:  <none>
PolicyRule:
  Resources                        Non-Resource URLs  Resource Names  Verbs
  ---------                        -----------------  --------------  -----
  configmaps                       []                 []              [create get update]
  events                           []                 []              [create get]
  nodes/stats                      []                 []              [create get]
                                   [/metrics]         []              [get]
  endpoints                        []                 []              [list watch get]
  namespaces                       []                 []              [list watch get]
  nodes/proxy                      []                 []              [list watch get]
  nodes                            []                 []              [list watch get]
  pods/logs                        []                 []              [list watch get]
  pods                             []                 []              [list watch get]
  daemonsets.apps                  []                 []              [list watch get]
  deployments.apps                 []                 []              [list watch get]
  replicasets.apps                 []                 []              [list watch get]
  statefulsets.apps                []                 []              [list watch get]
  endpointslices.discovery.k8s.io  []                 []              [list watch get]
  services                         []                 []              [list watch]
  jobs.batch                       []                 []              [list watch]
                                   [/metrics]         []              [list]
                                   [/metrics]         []              [watch]


Name:         amazon-cloudwatch-observability-manager-role # Cloudwatch Observability Manager 역할 - Cloudwatch와 Kubernetes 통합 관리할 수 있도록 CloudWatch Agent 설정 자동관리 권한 부여
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources                                                    Non-Resource URLs  Resource Names  Verbs
  ---------                                                    -----------------  --------------  -----
  configmaps                                                   []                 []              [create delete get list patch update watch]
  serviceaccounts                                              []                 []              [create delete get list patch update watch]
  services                                                     []                 []              [create delete get list patch update watch]
  daemonsets.apps                                              []                 []              [create delete get list patch update watch]
  deployments.apps                                             []                 []              [create delete get list patch update watch]
  statefulsets.apps                                            []                 []              [create delete get list patch update watch]
  ingresses.networking.k8s.io                                  []                 []              [create delete get list patch update watch]
  poddisruptionbudgets.policy                                  []                 []              [create delete get list patch update watch]
  routes.route.openshift.io/custom-host                        []                 []              [create delete get list patch update watch]
  routes.route.openshift.io                                    []                 []              [create delete get list patch update watch]
  leases.coordination.k8s.io                                   []                 []              [create get list update]
  events                                                       []                 []              [create patch]
  namespaces                                                   []                 []              [get list patch update watch]
  amazoncloudwatchagents.cloudwatch.aws.amazon.com             []                 []              [get list patch update watch]
  dcgmexporters.cloudwatch.aws.amazon.com                      []                 []              [get list patch update watch]
  instrumentations.cloudwatch.aws.amazon.com                   []                 []              [get list patch update watch]
  neuronmonitors.cloudwatch.aws.amazon.com                     []                 []              [get list patch update watch]
  replicasets.apps                                             []                 []              [get list watch]
  amazoncloudwatchagents.cloudwatch.aws.amazon.com/finalizers  []                 []              [get patch update]
  amazoncloudwatchagents.cloudwatch.aws.amazon.com/status      []                 []              [get patch update]
  dcgmexporters.cloudwatch.aws.amazon.com/finalizers           []                 []              [get patch update]
  dcgmexporters.cloudwatch.aws.amazon.com/status               []                 []              [get patch update]
  neuronmonitors.cloudwatch.aws.amazon.com/finalizers          []                 []              [get patch update]
  neuronmonitors.cloudwatch.aws.amazon.com/status              []                 []              [get patch update]

와우! Cloudwatch에 로그 수집도 잘 보이고, 이번에는 Fluent Bit만 활성화했을때와는 달리, Container Insights 탭도 활성화 되었습니다

img

img

img

img

테스트용 파드에 부하를 발생시켜 잘 감지해주는지 보도록 하겠습니다.

# NGINX 웹서버 배포
❯ helm repo add bitnami https://charts.bitnami.com/bitnami
❯ helm repo update

# 파라미터 파일 생성
❯ cat <<EOT > nginx-values.yaml
service:
  type: NodePort
  
networkPolicy:
  enabled: false
  
resourcesPreset: "nano"

ingress:
  enabled: true
  ingressClassName: alb
  hostname: nginx.$MyDomain
  pathType: Prefix
  path: /
  annotations: 
    alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
    alb.ingress.kubernetes.io/group.name: study
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
    alb.ingress.kubernetes.io/load-balancer-name: $CLUSTER_NAME-ingress-alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/success-codes: 200-399
    alb.ingress.kubernetes.io/target-type: ip
EOT

# 배포
❯ helm install nginx bitnami/nginx --version 19.0.0 -f nginx-values.yaml

# 접속주소 확인
❯ echo -e "Nginx WebServer URL = https://nginx.$MyDomain"
❯ https://nginx.hellouz818.com/
# 부하 발생
curl -s https://nginx.$MyDomain
yum install -y httpd
**ab** -c 500 -n 30000 https://nginx.$MyDomain/

# 파드 직접 로그 모니터링
kubectl logs deploy/nginx

img

img

kwatch

kwatch는 Kubernetes 클러스터의 모든 변경 사항을 모니터링하고, 실행 중인 앱의 Crash을 실시간으로 감지하고, 알람을 전송하는 오픈소스입니다.

https://github.com/abahmed/kwatch

GitHub - abahmed/kwatch: :eyes: monitor & detect crashes in your Kubernetes(K8s) cluster instantly

:eyes: monitor & detect crashes in your Kubernetes(K8s) cluster instantly - abahmed/kwatch

github.com

❯ NICK=hellouz818

❯ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: kwatch
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kwatch
  namespace: kwatch
data:
  config.yaml: |
    alert:
      slack:
        webhook: 'https://hooks.slack.com/services/xxx'
        title: $NICK-eks
    pvcMonitor:
      enabled: true
      interval: 5
      threshold: 70
EOF

❯ kubectl apply -f https://raw.githubusercontent.com/abahmed/kwatch/v0.8.5/deploy/deploy.yaml
namespace/kwatch unchanged
clusterrole.rbac.authorization.k8s.io/kwatch created
serviceaccount/kwatch created
clusterrolebinding.rbac.authorization.k8s.io/kwatch created
deployment.apps/kwatch created

이제 kwatch가 어떤 상황을 파악하고 알람을 보내주는지 예시 상황을 실습해보겠습니다.

우선 첫 번째로, 잘못된 이미지를 실행하는 파드를 배포 하고 kwatch에서 알람이 가는지 확인해보겠습니다.

❯ cat <<EOF | kubectl apply -f -                                                                                   
apiVersion: v1
kind: Pod
metadata:
  name: nginx-19
spec:
  containers:
  - name: nginx-pod
    image: nginx:1.19.19  # 존재하지 않는 이미지 버전
EOF
pod/nginx-19 created

슬랙으로 컨테이너가 imagepullback에러가 나는 것을 확인하고 알람이 온 것을 알 수 있습니다.

 ❯ ❯ kubectl logs kwatch-587f97fb7f-9829r -n kwatch                                                        ○ myeks 03:49:41
time="2025-03-01T18:45:39Z" level=info msg=":tada: kwatch@v0.8.5 just started!"
time="2025-03-01T18:45:39Z" level=info msg="initializing slack with webhook url: https://hooks.slack.com/services/xxx"
time="2025-03-01T18:45:39Z" level=info msg="sending message: :tada: kwatch@v0.8.5 just started!"
time="2025-03-01T18:45:40Z" level=info msg="starting pod-crash controller"
time="2025-03-01T18:45:40Z" level=info msg="sending message: :tada: A newer version <https://github.com/abahmed/kwatch/releases/tag/v0.10.1|v0.10.1> of Kwatch is available! Please update to the latest version."
time="2025-03-01T18:45:40Z" level=info msg="pod-crash controller synced and ready"
time="2025-03-01T18:48:12Z" level=warning msg="failed to get logs for container nginx-19 in pod nginx-pod@default: the server rejected our request for an unknown reason (get pods nginx-19)"
time="2025-03-01T18:48:12Z" level=info msg="sending event: {Name:nginx-19 Container:nginx-pod Namespace:default Reason:ErrImagePull Events:[2025-03-01 18:48:10 +0000 UTC] Scheduled Successfully assigned default/nginx-19 to ip-192-168-3-43.ap-northeast-2.compute.internal\n[2025-03-01 18:48:11 +0000 UTC] Pulling Pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Failed to pull image \"nginx:1.19.19\": rpc error: code = NotFound desc = failed to pull and unpack image \"docker.io/library/nginx:1.19.19\": failed to resolve reference \"docker.io/library/nginx:1.19.19\": docker.io/library/nginx:1.19.19: not found\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ErrImagePull\n[2025-03-01 18:48:12 +0000 UTC] BackOff Back-off pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ImagePullBackOff Logs:previous terminated container \"nginx-pod\" in pod \"nginx-19\" not found Labels:map[]}"
time="2025-03-01T18:48:12Z" level=info msg="sending to slack event: &{nginx-19 nginx-pod default ErrImagePull [2025-03-01 18:48:10 +0000 UTC] Scheduled Successfully assigned default/nginx-19 to ip-192-168-3-43.ap-northeast-2.compute.internal\n[2025-03-01 18:48:11 +0000 UTC] Pulling Pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Failed to pull image \"nginx:1.19.19\": rpc error: code = NotFound desc = failed to pull and unpack image \"docker.io/library/nginx:1.19.19\": failed to resolve reference \"docker.io/library/nginx:1.19.19\": docker.io/library/nginx:1.19.19: not found\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ErrImagePull\n[2025-03-01 18:48:12 +0000 UTC] BackOff Back-off pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ImagePullBackOff previous terminated container \"nginx-pod\" in pod \"nginx-19\" not found map[]}"
time="2025-03-01T18:48:46Z" level=info msg="pod default/nginx-19 does not exist anymore\n"

PVC와 pod을 배포하겠습니다.

# PVC, Pod 생성
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: test-container
      image: ubuntu
      command: ["/bin/bash", "-c", "sleep infinity"]
      volumeMounts:
        - mountPath: "/data"
          name: storage
  volumes:
    - name: storage
      persistentVolumeClaim:
        claimName: test-pvc
        
# /data 용량 확인
root@test-pod:/# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    960M   40M  921M   5% /data

# Pod 접근 후 PVC 용량 채우기
❯ kubectl exec -it test-pod -- bash

# 데이터 생성
fallocate -l 700M /data/testfile.img

# /data 확인
df -h /data
root@test-pod:/# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    960M  740M  221M  77% /data

# 100M 추가 후 확인
root@test-pod:/data# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    960M  840M  121M  88% /data