Kubernetes 集群监控与日志方案

前言

生产环境的 Kubernetes 集群必须有完善的监控和日志系统。本文介绍如何使用 Prometheus + Grafana 进行指标监控,以及使用 EFK(Elasticsearch + Fluentd + Kibana)进行日志收集。

一、监控架构设计

┌─────────────┐
│ Kubernetes  │
│   Cluster   │
└──────┬──────┘
       │
       ├─────> Prometheus (指标采集)
       │       └─────> Grafana (可视化)
       │
       └─────> Fluentd (日志采集)
               └─────> Elasticsearch (存储)
                       └─────> Kibana (查询)

二、部署 Prometheus

1. 使用 Helm 安装

# 添加 Prometheus 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 创建命名空间
kubectl create namespace monitoring

# 安装 kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack   --namespace monitoring   --set prometheus.prometheusSpec.retention=30d   --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

2. 配置 ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

3. 自定义告警规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  groups:
  - name: app
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status="500"}[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is above 5% for 5 minutes"

三、配置 Grafana 仪表盘

1. 访问 Grafana

# 端口转发
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# 默认用户名: admin
# 获取密码
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

2. 导入常用仪表盘

推荐仪表盘 ID:

• 3119 - Kubernetes cluster monitoring

• 7249 - Kubernetes cluster

• 6417 - Kubernetes Pod

3. 自定义面板

# CPU 使用率
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])

# 内存使用
container_memory_working_set_bytes{namespace="production"}

# Pod 重启次数
kube_pod_container_status_restarts_total

# API Server 延迟
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))

四、部署 EFK 日志栈

1. 部署 Elasticsearch

apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: logging
spec:
  ports:
  - port: 9200
  selector:
    app: elasticsearch
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
spec:
  serviceName: elasticsearch
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
        env:
        - name: discovery.type
          value: single-node
        - name: ES_JAVA_OPTS
          value: "-Xms2g -Xmx2g"
        ports:
        - containerPort: 9200
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 100Gi

2. 部署 Fluentd

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

3. 部署 Kibana

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:8.11.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: "http://elasticsearch:9200"
        ports:
        - containerPort: 5601

五、日志查询最佳实践

1. Kibana 查询语法

# 查询特定 Pod 日志
kubernetes.pod_name:"myapp-*" AND kubernetes.namespace_name:"production"

# 查询错误日志
log:"error" OR log:"exception" OR log:"fatal"

# 查询特定时间范围
@timestamp:[now-1h TO now]

# 查询 HTTP 500 错误
status:500 AND kubernetes.labels.app:"api"

2. 创建告警

# 在 Kibana 中创建告警规则
{
  "name": "High Error Rate",
  "schedule": {
    "interval": "5m"
  },
  "conditions": {
    "query": "log:error",
    "threshold": 100
  },
  "actions": [
    {
      "type": "email",
      "to": "ops@example.com"
    }
  ]
}

六、监控指标详解

关键指标

集群级别:

• CPU/内存使用率

• 节点健康状态

• API Server 响应时间

• etcd 性能

应用级别:

• Pod 重启次数

• 容器资源使用

• HTTP 请求 QPS

• 错误率

• 响应时间分位数(P50, P95, P99)

七、故障排查流程

1. 检查 Grafana 仪表盘发现异常
   ↓
2. 查看 Prometheus 告警详情
   ↓
3. 使用 kubectl 查看 Pod 状态
   kubectl get pods -n production
   kubectl describe pod 
   ↓
4. 在 Kibana 中查看应用日志
   ↓
5. 分析指标和日志,定位问题
   ↓
6. 修复问题,验证恢复

八、成本优化建议

• 日志保留策略:热数据7天,温数据30天,冷数据90天

• 使用日志采样:高频日志只采样10%

• Prometheus 数据压缩和降采样

• 合理配置资源限制

总结

完善的监控和日志系统是生产环境的基础设施。Prometheus + Grafana 提供强大的指标监控,EFK 提供灵活的日志查询。建议从小规模开始,逐步完善监控体系。

← 返回首页

💬 评论 (11)

J
James Wilson
2026-02-22
Comprehensive guide! We're implementing this exact stack for our production cluster. The alert rules are particularly helpful.
李军
2026-02-22
太详细了!正好公司要搭建 K8s 监控系统,这篇文章解决了大部分问题。
M
Maria Garcia
2026-02-22
Quick question: What's the recommended retention period for Prometheus data in a medium-sized cluster?
张伟(博主)
2026-02-22
@Maria For most cases, 15-30 days is good. Use remote storage (like Thanos) for long-term retention if needed.
王强
2026-02-23
Elasticsearch 集群规模怎么规划?我们大概每天产生 50GB 日志。
张伟(博主)
2026-02-23
@王强 50GB/天建议 3 节点起步,每节点 500GB 存储,保留 30天。注意配置 ILM 策略自动删除旧数据。
R
Raj Kumar
2026-02-23
Have you considered Loki as an alternative to EFK? It's supposed to be more lightweight and cost-effective.
赵丽
2026-02-24
Grafana 仪表盘配置能分享一下吗?自己配置太费时间了。
S
Sarah Johnson
2026-02-24
Excellent article! The cost optimization section is really valuable. We cut our logging costs by 40% following these tips.
陈浩
2026-02-25
收藏了!故障排查流程那部分特别实用,已经打印贴在办公桌上了 😄
D
David Lee
2026-02-26
@Raj I second that! We switched to Loki and it's been great. Much simpler to maintain than Elasticsearch.

发表评论