07-观测与SRE
Kubernetes 监控、日志、链路追踪全体系
学习目标
通过本模块学习,你将掌握:
- Kubernetes 可观测性三大支柱(指标、日志、链路)
- Prometheus + Grafana 监控体系
- 日志收集与聚合方案
- 分布式链路追踪
- 事件分析与审计日志
- SLI/SLO/SLA 体系建设
一、可观测性架构概览
可观测性三大支柱
┌─────────────────────────────────────────────────────────────┐
│ 可观测性体系(Observability) │
├─────────────────────────────────────────────────────────────┤
│ Metrics(指标) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Prometheus + Grafana + AlertManager │ │
│ │ kube-state-metrics + node-exporter │ │
│ └─────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Logs(日志) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Fluent Bit / Vector + Loki / Elasticsearch │ │
│ │ 应用日志 + 审计日志 + 系统日志 │ │
│ └─────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Traces(链路追踪) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OpenTelemetry + Jaeger / Tempo │ │
│ │ 分布式调用链分析 │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
观测对象分层
层级 | 观测对象 | 关键指标 | 工具 |
---|---|---|---|
应用层 | Pod、容器、应用 | QPS、延迟、错误率 | Prometheus、APM |
平台层 | Deployment、Service | 可用性、副本数 | kube-state-metrics |
基础设施层 | Node、网络、存储 | CPU、内存、磁盘、网络 | node-exporter、cAdvisor |
控制面层 | APIServer、etcd、Scheduler | 延迟、吞吐量、队列 | 组件内置指标 |
二、Prometheus 监控体系
2.1 Prometheus 架构
graph TD
A[应用/服务] -->|暴露指标| B[Prometheus Server]
C[kube-state-metrics] -->|K8s 资源指标| B
D[node-exporter] -->|节点指标| B
E[cAdvisor] -->|容器指标| B
B -->|存储| F[TSDB]
B -->|查询| G[Grafana]
B -->|告警| H[AlertManager]
H -->|通知| I[邮件/钉钉/Slack]
2.2 Prometheus 部署配置
使用 Prometheus Operator 部署
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
# 安装 Prometheus Operator
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
2.3 kube-state-metrics 部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: monitoring
labels:
app: kube-state-metrics
spec:
selector:
app: kube-state-metrics
ports:
- name: http-metrics
port: 8080
targetPort: 8080
- name: telemetry
port: 8081
targetPort: 8081
2.4 node-exporter 部署
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.6.1
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- containerPort: 9100
name: metrics
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
readOnly: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
app: node-exporter
ports:
- name: metrics
port: 9100
targetPort: 9100
2.5 Grafana 部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.1.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin123"
- name: GF_SERVER_ROOT_URL
value: "http://grafana.example.com"
volumeMounts:
- name: storage
mountPath: /var/lib/grafana
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
volumes:
- name: storage
emptyDir: {}
- name: datasources
configMap:
name: grafana-datasources
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
selector:
app: grafana
ports:
- port: 3000
targetPort: 3000
type: LoadBalancer
三、日志收集与聚合
3.1 Fluent Bit 日志收集
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluent-bit
rules:
- apiGroups: [""]
resources:
- namespaces
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluent-bit
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name loki
Match *
Host loki
Port 3100
Labels job=fluentbit
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config
mountPath: /fluent-bit/etc/
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: config
configMap:
name: fluent-bit-config
3.2 Loki 日志存储
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: loki
namespace: logging
spec:
serviceName: loki
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:2.9.0
args:
- -config.file=/etc/loki/loki.yaml
ports:
- containerPort: 3100
name: http
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /loki
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
volumes:
- name: config
configMap:
name: loki-config
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: logging
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2023-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
---
apiVersion: v1
kind: Service
metadata:
name: loki
namespace: logging
spec:
selector:
app: loki
ports:
- port: 3100
targetPort: 3100
四、分布式链路追踪
4.1 OpenTelemetry Collector 部署
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: tracing
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger, logging]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: tracing
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector:0.88.0
args:
- --config=/etc/otel/otel-collector-config.yaml
ports:
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
volumeMounts:
- name: config
mountPath: /etc/otel
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: tracing
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
4.2 Jaeger 部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: tracing
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
- containerPort: 16686
protocol: TCP
- containerPort: 14250
protocol: TCP
- containerPort: 14268
protocol: TCP
- containerPort: 14269
protocol: TCP
- containerPort: 9411
protocol: TCP
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-collector
namespace: tracing
spec:
selector:
app: jaeger
ports:
- name: jaeger-collector-grpc
port: 14250
targetPort: 14250
- name: jaeger-collector-http
port: 14268
targetPort: 14268
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-query
namespace: tracing
spec:
selector:
app: jaeger
ports:
- name: query-http
port: 16686
targetPort: 16686
type: LoadBalancer
五、事件分析与审计
5.1 Kubernetes Events 监控
# 查看所有事件
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
# 查看特定命名空间事件
kubectl get events -n production --sort-by='.lastTimestamp'
# 查看特定对象的事件
kubectl get events --field-selector involvedObject.name=my-pod
# 监控事件实时变化
kubectl get events -w
# 事件按类型过滤
kubectl get events --field-selector type=Warning
# 事件按原因过滤
kubectl get events --field-selector reason=Failed
5.2 APIServer 审计日志配置
apiVersion: v1
kind: ConfigMap
metadata:
name: audit-policy
namespace: kube-system
data:
audit-policy.yaml: |
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# 不记录只读请求
- level: None
verbs: ["get", "list", "watch"]
# 记录 Secret 的元数据
- level: Metadata
resources:
- group: ""
resources: ["secrets"]
# 记录所有其他资源的请求和响应体
- level: RequestResponse
resources:
- group: ""
- group: "apps"
- group: "batch"
# 默认不记录
- level: None
---
# 在 kube-apiserver 启动参数中添加:
# --audit-policy-file=/etc/kubernetes/audit-policy.yaml
# --audit-log-path=/var/log/kubernetes/audit.log
# --audit-log-maxage=30
# --audit-log-maxbackup=10
# --audit-log-maxsize=100
六、AlertManager 告警配置
6.1 AlertManager 部署
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
webhook_configs:
- url: 'http://webhook.example.com/alert'
- name: 'warning'
email_configs:
- to: 'team@example.com'
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.26.0
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- containerPort: 9093
volumeMounts:
- name: config
mountPath: /etc/alertmanager
- name: storage
mountPath: /alertmanager
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: config
configMap:
name: alertmanager-config
- name: storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
spec:
selector:
app: alertmanager
ports:
- port: 9093
targetPort: 9093
6.2 Prometheus 告警规则
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
alert-rules.yml: |
groups:
- name: kubernetes-cluster
interval: 30s
rules:
# 节点不可用告警
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 5 minutes."
# Pod CrashLooping 告警
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently."
# 内存使用率告警
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% on {{ $labels.instance }}."
# CPU 使用率告警
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% on {{ $labels.instance }}."
# 磁盘使用率告警
- alert: HighDiskUsage
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk usage is above 90% on {{ $labels.instance }}."
# Pod Pending 告警
- alert: PodPending
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is pending"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been pending for more than 10 minutes."
# Deployment 副本数不足告警
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 5m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch"
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $value }} replicas available, expected {{ $labels.spec_replicas }}."
️ 七、命令速记
监控相关命令
# 查看节点资源使用
kubectl top nodes
# 查看 Pod 资源使用
kubectl top pods -A
# 查看指标 API
kubectl get --raw /metrics
# 查看 APIServer 指标
kubectl get --raw /metrics | grep apiserver
# 查看事件
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
# 查看 Prometheus 目标
curl http://prometheus:9090/api/v1/targets
# 查看告警规则
curl http://prometheus:9090/api/v1/rules
# 查看活跃告警
curl http://prometheus:9090/api/v1/alerts
日志相关命令
# 查看 Pod 日志
kubectl logs <pod-name>
# 查看多个容器日志
kubectl logs <pod-name> -c <container-name>
# 查看之前的日志
kubectl logs <pod-name> --previous
# 实时跟踪日志
kubectl logs -f <pod-name>
# 查看最近 N 行日志
kubectl logs <pod-name> --tail=100
# 查看指定时间范围日志
kubectl logs <pod-name> --since=1h
# 查看所有容器日志
kubectl logs <pod-name> --all-containers
链路追踪命令
# 查看 Jaeger UI
kubectl port-forward -n tracing svc/jaeger-query 16686:16686
# 查看 OpenTelemetry Collector 状态
kubectl get pods -n tracing -l app=otel-collector
# 查看 Collector 日志
kubectl logs -n tracing -l app=otel-collector
八、面试核心问答
Q1: Kubernetes 可观测性的三大支柱是什么?
答案要点:
- Metrics(指标):Prometheus + Grafana,监控系统和应用状态
- Logs(日志):Fluent Bit/Loki,收集和分析日志
- Traces(链路追踪):OpenTelemetry + Jaeger,分析分布式调用链
- 三者结合提供完整的系统可观测性
Q2: Prometheus 的工作原理是什么?
答案要点:
- 拉取模型(Pull):定期从目标拉取指标
- 时序数据库:存储时间序列数据
- PromQL:强大的查询语言
- 服务发现:自动发现 Kubernetes 资源
- 告警规则:基于指标触发告警
Q3: 如何收集 Kubernetes 集群日志?
答案要点:
- 使用 DaemonSet 部署日志收集器(Fluent Bit/Vector)
- 从节点 /var/log/containers 收集容器日志
- 使用 Kubernetes API 获取元数据
- 发送到集中式存储(Loki/Elasticsearch)
- 配置日志过滤和解析规则
Q4: 什么是分布式链路追踪?
答案要点:
- 追踪请求在微服务间的完整调用链
- 使用 Trace ID 关联所有相关 Span
- OpenTelemetry 提供统一的追踪标准
- 帮助定位性能瓶颈和故障点
- 可视化服务依赖关系
Q5: 如何设计 SLI/SLO/SLA?
答案要点:
- SLI(服务级别指标):可用性、延迟、错误率
- SLO(服务级别目标):如 99.9% 可用性
- SLA(服务级别协议):对外承诺的服务水平
- 基于业务需求设定合理目标
- 使用监控数据验证和优化
九、故障排查
常见监控问题
1. Prometheus 无法抓取指标
# 检查 Prometheus 配置
kubectl get configmap prometheus-config -n monitoring -o yaml
# 查看 Prometheus 日志
kubectl logs -n monitoring -l app=prometheus
# 检查目标状态
curl http://prometheus:9090/api/v1/targets
# 检查服务发现
kubectl get pods -n monitoring -l app=prometheus -o yaml | grep prometheus.io
2. 日志收集不完整
# 检查 Fluent Bit 状态
kubectl get pods -n logging -l app=fluent-bit
# 查看 Fluent Bit 日志
kubectl logs -n logging -l app=fluent-bit
# 检查日志路径
kubectl exec -n logging <fluent-bit-pod> -- ls -la /var/log/containers
# 验证配置
kubectl get configmap fluent-bit-config -n logging -o yaml
3. 告警不触发
# 检查告警规则
curl http://prometheus:9090/api/v1/rules
# 查看活跃告警
curl http://prometheus:9090/api/v1/alerts
# 检查 AlertManager 配置
kubectl get configmap alertmanager-config -n monitoring -o yaml
# 查看 AlertManager 日志
kubectl logs -n monitoring -l app=alertmanager
4. Grafana 无数据
# 检查数据源配置
kubectl get configmap grafana-datasources -n monitoring -o yaml
# 测试 Prometheus 连接
kubectl exec -n monitoring <grafana-pod> -- curl http://prometheus:9090/api/v1/query?query=up
# 查看 Grafana 日志
kubectl logs -n monitoring -l app=grafana
十、最佳实践
监控设计建议
指标采集
- 采集率不宜过高(15-30s 合适)
- 设置合理的数据保留期
- 使用标签精简指标
- 监控控制面组件
日志管理
- 结构化日志输出
- 设置日志级别
- 配置日志轮转
- 敏感信息脱敏
告警策略
- 避免告警疲劳
- 设置合理阈值
- 分级告警处理
- 定期回顾优化
性能优化
- 控制指标数量
- 使用本地存储
- 配置资源限制
- 定期清理数据
SRE 实践建议
SLI 定义
- 可用性:99.9%
- 延迟:P95 < 500ms
- 错误率:< 0.1%
SLO 设定
- 基于业务需求
- 留有误差预算
- 定期评估调整
故障响应
- 建立 Runbook
- 自动化修复
- 事后复盘
十一、总结
通过本模块学习,你已经掌握了:
- Kubernetes 可观测性三大支柱
- Prometheus + Grafana 监控体系
- Fluent Bit/Loki 日志收集方案
- OpenTelemetry 链路追踪
- 事件分析和审计日志
- AlertManager 告警配置
- SLI/SLO/SLA 体系建设
- 监控故障排查技能
下一步建议:继续学习 08-可靠性运维,深入了解 Kubernetes 备份、升级和容灾策略。