Kubernetes 集群监控与日志方案
📅 2026-02-22 | ⏱️ 18分钟 | 👁 412 次阅读 | 💬 11 条评论
前言
生产环境的 Kubernetes 集群必须有完善的监控和日志系统。本文介绍如何使用 Prometheus + Grafana 进行指标监控,以及使用 EFK(Elasticsearch + Fluentd + Kibana)进行日志收集。
一、监控架构设计
┌─────────────┐
│ Kubernetes │
│ Cluster │
└──────┬──────┘
│
├─────> Prometheus (指标采集)
│ └─────> Grafana (可视化)
│
└─────> Fluentd (日志采集)
└─────> Elasticsearch (存储)
└─────> Kibana (查询)
二、部署 Prometheus
1. 使用 Helm 安装
# 添加 Prometheus 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 创建命名空间
kubectl create namespace monitoring
# 安装 kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --set prometheus.prometheusSpec.retention=30d --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
2. 配置 ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
3. 自定义告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for 5 minutes"
三、配置 Grafana 仪表盘
1. 访问 Grafana
# 端口转发
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# 默认用户名: admin
# 获取密码
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
2. 导入常用仪表盘
推荐仪表盘 ID:
• 3119 - Kubernetes cluster monitoring
• 7249 - Kubernetes cluster
• 6417 - Kubernetes Pod
3. 自定义面板
# CPU 使用率
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
# 内存使用
container_memory_working_set_bytes{namespace="production"}
# Pod 重启次数
kube_pod_container_status_restarts_total
# API Server 延迟
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))
四、部署 EFK 日志栈
1. 部署 Elasticsearch
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: logging
spec:
ports:
- port: 9200
selector:
app: elasticsearch
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
env:
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: "-Xms2g -Xmx2g"
ports:
- containerPort: 9200
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
2. 部署 Fluentd
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
3. 部署 Kibana
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:8.11.0
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch:9200"
ports:
- containerPort: 5601
五、日志查询最佳实践
1. Kibana 查询语法
# 查询特定 Pod 日志
kubernetes.pod_name:"myapp-*" AND kubernetes.namespace_name:"production"
# 查询错误日志
log:"error" OR log:"exception" OR log:"fatal"
# 查询特定时间范围
@timestamp:[now-1h TO now]
# 查询 HTTP 500 错误
status:500 AND kubernetes.labels.app:"api"
2. 创建告警
# 在 Kibana 中创建告警规则
{
"name": "High Error Rate",
"schedule": {
"interval": "5m"
},
"conditions": {
"query": "log:error",
"threshold": 100
},
"actions": [
{
"type": "email",
"to": "ops@example.com"
}
]
}
六、监控指标详解
关键指标
集群级别:
• CPU/内存使用率
• 节点健康状态
• API Server 响应时间
• etcd 性能
应用级别:
• Pod 重启次数
• 容器资源使用
• HTTP 请求 QPS
• 错误率
• 响应时间分位数(P50, P95, P99)
七、故障排查流程
1. 检查 Grafana 仪表盘发现异常
↓
2. 查看 Prometheus 告警详情
↓
3. 使用 kubectl 查看 Pod 状态
kubectl get pods -n production
kubectl describe pod
↓
4. 在 Kibana 中查看应用日志
↓
5. 分析指标和日志,定位问题
↓
6. 修复问题,验证恢复
八、成本优化建议
• 日志保留策略:热数据7天,温数据30天,冷数据90天
• 使用日志采样:高频日志只采样10%
• Prometheus 数据压缩和降采样
• 合理配置资源限制
总结
完善的监控和日志系统是生产环境的基础设施。Prometheus + Grafana 提供强大的指标监控,EFK 提供灵活的日志查询。建议从小规模开始,逐步完善监控体系。
💬 评论 (11)
发表评论