CloudWatch Container Insight
AWS 쿠버네티스에서 실행되는 컨테이너의 리소스 사용량 및 성능 데이터를 자동으로 수집하여 CloudWatch 대시보드에서 시각화하는 기능으로, AWS 콘솔에서 자동으로 대시보드를 생성해줍니다.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html
Container Insights - Amazon CloudWatch
Container Insights Use CloudWatch Container Insights to collect, aggregate, and summarize metrics and logs from your containerized applications and microservices. Container Insights is available for Amazon Elastic Container Service (Amazon ECS), Amazon Ela
docs.aws.amazon.com
CloudWatch Agent
EC2, EKS 노드 및 애플리케이션의 메트릭과 로그를 수집하여 CloudWatch로 전송하는 AWS 공식 에이전트입니다.
https://hellouz818.tistory.com/33
다음과 같은 환경을 구성하기 위해 Cloudwatch Container Observability를 설치하도록 하겠습니다.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart - Amazon CloudWatch
Any custom configuration that you provide using additional configuration settings overrides the default configuration used by the agent. Be cautious not to unintentionally disable functionality that is enabled by default, such as Container Insights with en
docs.aws.amazon.com

Cloudwatch에 전송할 수 있는 IAM 서비스 어카운트 권한을 먼저 생성한 후 addon을 배포합니다.
# IRSA 생성
❯ eksctl create iamserviceaccount \
--name cloudwatch-agent \
--namespace amazon-cloudwatch --cluster $CLUSTER_NAME \
--role-name $CLUSTER_NAME-cloudwatch-agent-role \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--role-only \
--approve
# add on 확인
❯ aws eks list-addons --cluster-name myeks --output table
--------------------------
| ListAddons |
+------------------------+
|| addons ||
|+----------------------+|
|| aws-ebs-csi-driver ||
|| coredns ||
|| kube-proxy ||
|| metrics-server ||
|| vpc-cni ||
|+----------------------+|
❯ aws eks create-addon --addon-name amazon-cloudwatch-observability --cluster-name myeks --service-account-role-arn arn:aws:iam::xxx:role/myeks-cloudwatch-agent-role
{
"addon": {
"addonName": "amazon-cloudwatch-observability",
"clusterName": "myeks",
"status": "CREATING",
"addonVersion": "v3.3.1-eksbuild.1",
"health": {
"issues": []
},
"addonArn": "arn:aws:eks:ap-northeast-2:xxx:addon/myeks/amazon-cloudwatch-observability/aecaa964-5132-4449-73b3-b47db792e3ed",
"createdAt": "2025-03-02T02:37:15.576000+09:00",
"modifiedAt": "2025-03-02T02:37:15.591000+09:00",
"serviceAccountRoleArn": "arn:aws:iam::xxx:role/myeks-cloudwatch-agent-role",
"tags": {}
}
}
# add on 확인
❯ aws eks list-addons --cluster-name myeks --output table
---------------------------------------
| ListAddons |
+-------------------------------------+
|| addons ||
|+-----------------------------------+|
|| amazon-cloudwatch-observability || # 추가
|| aws-ebs-csi-driver ||
|| coredns ||
|| kube-proxy ||
|| metrics-server ||
|| vpc-cni ||
|+-----------------------------------+|
# 설치 확인
❯ kubectl get crd | grep -i cloudwatch
amazoncloudwatchagents.cloudwatch.aws.amazon.com 2025-03-01T17:37:34Z
dcgmexporters.cloudwatch.aws.amazon.com 2025-03-01T17:37:35Z
instrumentations.cloudwatch.aws.amazon.com 2025-03-01T17:37:35Z
neuronmonitors.cloudwatch.aws.amazon.com 2025-03-01T17:37:36Z
❯ kubectl get ds,pod,cm,sa,amazoncloudwatchagent -n amazon-cloudwatch
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/cloudwatch-agent 3 3 3 3 3 kubernetes.io/os=linux 76s
daemonset.apps/cloudwatch-agent-windows 0 0 0 0 0 kubernetes.io/os=windows 76s
daemonset.apps/cloudwatch-agent-windows-container-insights 0 0 0 0 0 kubernetes.io/os=windows 76s
daemonset.apps/dcgm-exporter 0 0 0 0 0 kubernetes.io/os=linux 76s
daemonset.apps/fluent-bit 3 3 3 3 3 kubernetes.io/os=linux 83s
daemonset.apps/fluent-bit-windows 0 0 0 0 0 kubernetes.io/os=windows 83s
daemonset.apps/neuron-monitor 0 0 0 0 0 <none> 76s
NAME READY STATUS RESTARTS AGE
pod/amazon-cloudwatch-observability-controller-manager-6f76854n2crw 1/1 Running 0 83s
pod/cloudwatch-agent-9b29g 1/1 Running 0 76s
pod/cloudwatch-agent-gjnkc 1/1 Running 0 76s
pod/cloudwatch-agent-w7dzp 1/1 Running 0 76s
pod/fluent-bit-drkgz 1/1 Running 0 83s
pod/fluent-bit-f2xpd 1/1 Running 0 83s
pod/fluent-bit-m5zh6 1/1 Running 0 83s
NAME DATA AGE
configmap/cloudwatch-agent 1 76s
configmap/cloudwatch-agent-windows 1 76s
configmap/cloudwatch-agent-windows-container-insights 1 76s
configmap/cwagent-clusterleader 0 60s
configmap/dcgm-exporter-config-map 2 76s
configmap/fluent-bit-config 5 85s
configmap/fluent-bit-windows-config 5 85s
configmap/kube-root-ca.crt 1 87s
configmap/neuron-monitor-config-map 1 76s
NAME SECRETS AGE
serviceaccount/amazon-cloudwatch-observability-controller-manager 0 85s
serviceaccount/cloudwatch-agent 0 85s
serviceaccount/dcgm-exporter-service-acct 0 76s
serviceaccount/default 0 87s
serviceaccount/neuron-monitor-service-acct 0 76s
NAME MODE VERSION READY AGE IMAGE MANAGEMENT
amazoncloudwatchagent.cloudwatch.aws.amazon.com/cloudwatch-agent daemonset 0.0.0 83s managed
amazoncloudwatchagent.cloudwatch.aws.amazon.com/cloudwatch-agent-windows daemonset 0.0.0 82s managed
amazoncloudwatchagent.cloudwatch.aws.amazon.com/cloudwatch-agent-windows-container-insights daemonset 0.0.0 83s managed
❯ kubectl describe clusterrole cloudwatch-agent-role amazon-cloudwatch-observability-manager-role
Name: cloudwatch-agent-role # CloudWatch agent가 EKS 클러스터 메트릭 및 로그 수집을 위한 권한 - 메트릭 수집할 수 있도록 필요한 리소스에 대한 읽기 권한 부여
Labels: app.kubernetes.io/instance=amazon-cloudwatch-observability
app.kubernetes.io/managed-by=EKS
app.kubernetes.io/name=amazon-cloudwatch-observability
app.kubernetes.io/version=1.0.0
Annotations: <none>
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
configmaps [] [] [create get update]
events [] [] [create get]
nodes/stats [] [] [create get]
[/metrics] [] [get]
endpoints [] [] [list watch get]
namespaces [] [] [list watch get]
nodes/proxy [] [] [list watch get]
nodes [] [] [list watch get]
pods/logs [] [] [list watch get]
pods [] [] [list watch get]
daemonsets.apps [] [] [list watch get]
deployments.apps [] [] [list watch get]
replicasets.apps [] [] [list watch get]
statefulsets.apps [] [] [list watch get]
endpointslices.discovery.k8s.io [] [] [list watch get]
services [] [] [list watch]
jobs.batch [] [] [list watch]
[/metrics] [] [list]
[/metrics] [] [watch]
Name: amazon-cloudwatch-observability-manager-role # Cloudwatch Observability Manager 역할 - Cloudwatch와 Kubernetes 통합 관리할 수 있도록 CloudWatch Agent 설정 자동관리 권한 부여
Labels: <none>
Annotations: <none>
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
configmaps [] [] [create delete get list patch update watch]
serviceaccounts [] [] [create delete get list patch update watch]
services [] [] [create delete get list patch update watch]
daemonsets.apps [] [] [create delete get list patch update watch]
deployments.apps [] [] [create delete get list patch update watch]
statefulsets.apps [] [] [create delete get list patch update watch]
ingresses.networking.k8s.io [] [] [create delete get list patch update watch]
poddisruptionbudgets.policy [] [] [create delete get list patch update watch]
routes.route.openshift.io/custom-host [] [] [create delete get list patch update watch]
routes.route.openshift.io [] [] [create delete get list patch update watch]
leases.coordination.k8s.io [] [] [create get list update]
events [] [] [create patch]
namespaces [] [] [get list patch update watch]
amazoncloudwatchagents.cloudwatch.aws.amazon.com [] [] [get list patch update watch]
dcgmexporters.cloudwatch.aws.amazon.com [] [] [get list patch update watch]
instrumentations.cloudwatch.aws.amazon.com [] [] [get list patch update watch]
neuronmonitors.cloudwatch.aws.amazon.com [] [] [get list patch update watch]
replicasets.apps [] [] [get list watch]
amazoncloudwatchagents.cloudwatch.aws.amazon.com/finalizers [] [] [get patch update]
amazoncloudwatchagents.cloudwatch.aws.amazon.com/status [] [] [get patch update]
dcgmexporters.cloudwatch.aws.amazon.com/finalizers [] [] [get patch update]
dcgmexporters.cloudwatch.aws.amazon.com/status [] [] [get patch update]
neuronmonitors.cloudwatch.aws.amazon.com/finalizers [] [] [get patch update]
neuronmonitors.cloudwatch.aws.amazon.com/status [] [] [get patch update]
와우! Cloudwatch에 로그 수집도 잘 보이고, 이번에는 Fluent Bit만 활성화했을때와는 달리, Container Insights 탭도 활성화 되었습니다




테스트용 파드에 부하를 발생시켜 잘 감지해주는지 보도록 하겠습니다.
# NGINX 웹서버 배포
❯ helm repo add bitnami https://charts.bitnami.com/bitnami
❯ helm repo update
# 파라미터 파일 생성
❯ cat <<EOT > nginx-values.yaml
service:
type: NodePort
networkPolicy:
enabled: false
resourcesPreset: "nano"
ingress:
enabled: true
ingressClassName: alb
hostname: nginx.$MyDomain
pathType: Prefix
path: /
annotations:
alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
alb.ingress.kubernetes.io/group.name: study
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
alb.ingress.kubernetes.io/load-balancer-name: $CLUSTER_NAME-ingress-alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/success-codes: 200-399
alb.ingress.kubernetes.io/target-type: ip
EOT
# 배포
❯ helm install nginx bitnami/nginx --version 19.0.0 -f nginx-values.yaml
# 접속주소 확인
❯ echo -e "Nginx WebServer URL = https://nginx.$MyDomain"
❯ https://nginx.hellouz818.com/
# 부하 발생
curl -s https://nginx.$MyDomain
yum install -y httpd
**ab** -c 500 -n 30000 https://nginx.$MyDomain/
# 파드 직접 로그 모니터링
kubectl logs deploy/nginx


kwatch
kwatch는 Kubernetes 클러스터의 모든 변경 사항을 모니터링하고, 실행 중인 앱의 Crash을 실시간으로 감지하고, 알람을 전송하는 오픈소스입니다.
https://github.com/abahmed/kwatch
GitHub - abahmed/kwatch:
monitor & detect crashes in your Kubernetes(K8s) cluster instantly
monitor & detect crashes in your Kubernetes(K8s) cluster instantly - abahmed/kwatch
github.com
❯ NICK=hellouz818
❯ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: kwatch
---
apiVersion: v1
kind: ConfigMap
metadata:
name: kwatch
namespace: kwatch
data:
config.yaml: |
alert:
slack:
webhook: 'https://hooks.slack.com/services/xxx'
title: $NICK-eks
pvcMonitor:
enabled: true
interval: 5
threshold: 70
EOF
❯ kubectl apply -f https://raw.githubusercontent.com/abahmed/kwatch/v0.8.5/deploy/deploy.yaml
namespace/kwatch unchanged
clusterrole.rbac.authorization.k8s.io/kwatch created
serviceaccount/kwatch created
clusterrolebinding.rbac.authorization.k8s.io/kwatch created
deployment.apps/kwatch created
이제 kwatch가 어떤 상황을 파악하고 알람을 보내주는지 예시 상황을 실습해보겠습니다.
우선 첫 번째로, 잘못된 이미지를 실행하는 파드를 배포 하고 kwatch에서 알람이 가는지 확인해보겠습니다.
❯ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx-19
spec:
containers:
- name: nginx-pod
image: nginx:1.19.19 # 존재하지 않는 이미지 버전
EOF
pod/nginx-19 created
슬랙으로 컨테이너가 imagepullback에러가 나는 것을 확인하고 알람이 온 것을 알 수 있습니다.
❯ ❯ kubectl logs kwatch-587f97fb7f-9829r -n kwatch ○ myeks 03:49:41
time="2025-03-01T18:45:39Z" level=info msg=":tada: kwatch@v0.8.5 just started!"
time="2025-03-01T18:45:39Z" level=info msg="initializing slack with webhook url: https://hooks.slack.com/services/xxx"
time="2025-03-01T18:45:39Z" level=info msg="sending message: :tada: kwatch@v0.8.5 just started!"
time="2025-03-01T18:45:40Z" level=info msg="starting pod-crash controller"
time="2025-03-01T18:45:40Z" level=info msg="sending message: :tada: A newer version <https://github.com/abahmed/kwatch/releases/tag/v0.10.1|v0.10.1> of Kwatch is available! Please update to the latest version."
time="2025-03-01T18:45:40Z" level=info msg="pod-crash controller synced and ready"
time="2025-03-01T18:48:12Z" level=warning msg="failed to get logs for container nginx-19 in pod nginx-pod@default: the server rejected our request for an unknown reason (get pods nginx-19)"
time="2025-03-01T18:48:12Z" level=info msg="sending event: {Name:nginx-19 Container:nginx-pod Namespace:default Reason:ErrImagePull Events:[2025-03-01 18:48:10 +0000 UTC] Scheduled Successfully assigned default/nginx-19 to ip-192-168-3-43.ap-northeast-2.compute.internal\n[2025-03-01 18:48:11 +0000 UTC] Pulling Pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Failed to pull image \"nginx:1.19.19\": rpc error: code = NotFound desc = failed to pull and unpack image \"docker.io/library/nginx:1.19.19\": failed to resolve reference \"docker.io/library/nginx:1.19.19\": docker.io/library/nginx:1.19.19: not found\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ErrImagePull\n[2025-03-01 18:48:12 +0000 UTC] BackOff Back-off pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ImagePullBackOff Logs:previous terminated container \"nginx-pod\" in pod \"nginx-19\" not found Labels:map[]}"
time="2025-03-01T18:48:12Z" level=info msg="sending to slack event: &{nginx-19 nginx-pod default ErrImagePull [2025-03-01 18:48:10 +0000 UTC] Scheduled Successfully assigned default/nginx-19 to ip-192-168-3-43.ap-northeast-2.compute.internal\n[2025-03-01 18:48:11 +0000 UTC] Pulling Pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Failed to pull image \"nginx:1.19.19\": rpc error: code = NotFound desc = failed to pull and unpack image \"docker.io/library/nginx:1.19.19\": failed to resolve reference \"docker.io/library/nginx:1.19.19\": docker.io/library/nginx:1.19.19: not found\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ErrImagePull\n[2025-03-01 18:48:12 +0000 UTC] BackOff Back-off pulling image \"nginx:1.19.19\"\n[2025-03-01 18:48:12 +0000 UTC] Failed Error: ImagePullBackOff previous terminated container \"nginx-pod\" in pod \"nginx-19\" not found map[]}"
time="2025-03-01T18:48:46Z" level=info msg="pod default/nginx-19 does not exist anymore\n"
PVC와 pod을 배포하겠습니다.
# PVC, Pod 생성
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: test-container
image: ubuntu
command: ["/bin/bash", "-c", "sleep infinity"]
volumeMounts:
- mountPath: "/data"
name: storage
volumes:
- name: storage
persistentVolumeClaim:
claimName: test-pvc
# /data 용량 확인
root@test-pod:/# df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 960M 40M 921M 5% /data
# Pod 접근 후 PVC 용량 채우기
❯ kubectl exec -it test-pod -- bash
# 데이터 생성
fallocate -l 700M /data/testfile.img
# /data 확인
df -h /data
root@test-pod:/# df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 960M 740M 221M 77% /data
# 100M 추가 후 확인
root@test-pod:/data# df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 960M 840M 121M 88% /data