15 - 监控系统设计
| > 面试频率: 需求类型 | 指标 |
|---|---|
| 采集频率 | 10-60 秒 |
| 数据保留 | 原始数据 7 天,聚合数据 90 天 |
| 查询延迟 | < 1 秒 |
| 告警延迟 | < 1 分钟 |
| 可用性 | 99.9% |
| 扩展性 | 支持 10 万+ 指标 |
1.3 面试官可能的追问
Q1: 监控系统和日志系统有什么区别?
A1:
- 监控:关注指标、趋势、告警(Metrics)
- 日志:关注事件、上下文、问题排查(Logs)
- 追踪:关注调用链、性能分析(Traces)
Q2: Prometheus 和 Zabbix 有什么区别?
A2:
- Prometheus:Pull 模型,时序数据库,云原生
- Zabbix:Push 模型,关系数据库,传统运维
- 选择:云原生选 Prometheus,传统基础设施选 Zabbix
2. 容量估算
2.1 场景假设
假设监控 1000 台服务器 + 500 个应用:
- 服务器:1000 台
- 每台指标数:100 个(CPU、内存、磁盘等)
- 应用实例:5000 个
- 每实例指标数:50 个
- 采集频率:15 秒
2.2 数据量估算
服务器指标:
1000 台 × 100 指标 × 4 采样/分钟 = 40 万 数据点/分钟
应用指标:
5000 实例 × 50 指标 × 4 采样/分钟 = 100 万 数据点/分钟
总计:140 万 数据点/分钟 ≈ 23,333 数据点/秒
2.3 存储估算
单个数据点 = 时间戳(8字节) + 值(8字节) = 16 字节
原始数据(7天):
140 万/分钟 × 60 × 24 × 7 × 16 字节 ≈ 162 TB
实际 Prometheus 压缩后约 1-2 字节/样本:
162 TB / 8 ≈ 20 TB
2.4 查询 QPS 估算
Dashboard 数量 = 200 个
平均刷新间隔 = 30 秒
每个 Dashboard 查询数 = 10
查询 QPS = 200 × 10 / 30 ≈ 67 QPS
3. 架构设计
3.1 整体架构
┌─────────────────────────────────────────────────────────────┐
│ 数据源 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 服务器 │ │ 应用 │ │ 数据库 │ │ 中间件 │ │
│ │(Node) │ │(App) │ │(MySQL) │ │(Redis) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼─────────┘
│ │ │ │
┌─────────────────────────────────────────────────────────────┐
│ 采集层 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node │ │ App │ │ MySQL │ │ Filebeat │ │
│ │ Exporter │ │ Exporter │ │ Exporter │ │(日志) │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────┬───────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ 存储层 │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Elasticsearch│ │ Jaeger │ │
│ │ (指标) │ │ (日志) │ │ (链路) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────┬───────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ 查询 & 告警层 │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ AlertManager │ │ Grafana │ │
│ │ (告警) │ │ (可视化) │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ 通知层 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 邮件 │ │ 短信 │ │ 钉钉 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
3.2 V1: 基于 Prometheus + Grafana
适用场景:中小型公司、微服务架构
package main
import (
"fmt"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// ==================== 自定义 Exporter ====================
// Metrics 指标定义
var (
// Counter: 单调递增
requestTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Gauge: 可增可减
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Histogram: 分布统计
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latencies",
Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1, 2, 5},
},
[]string{"method", "endpoint"},
)
// Summary: 分位数统计
requestSize = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_request_size_bytes",
Help: "HTTP request sizes",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"method", "endpoint"},
)
)
func init() {
// 注册指标
prometheus.MustRegister(requestTotal)
prometheus.MustRegister(activeConnections)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(requestSize)
}
// MetricsMiddleware 指标采集中间件
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// 增加活跃连接数
activeConnections.Inc()
defer activeConnections.Dec()
// 包装 ResponseWriter 以获取状态码
rw := &responseWriter{ResponseWriter: w, statusCode: 200}
// 处理请求
next.ServeHTTP(rw, r)
// 记录指标
duration := time.Since(start).Seconds()
requestTotal.WithLabelValues(
r.Method,
r.URL.Path,
fmt.Sprintf("%d", rw.statusCode),
).Inc()
requestDuration.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(duration)
requestSize.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(float64(r.ContentLength))
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
// ==================== 业务指标 ====================
// BusinessMetrics 业务指标
type BusinessMetrics struct {
orderTotal prometheus.Counter
orderAmount prometheus.Gauge
userRegister prometheus.Counter
}
func NewBusinessMetrics() *BusinessMetrics {
return &BusinessMetrics{
orderTotal: prometheus.NewCounter(
prometheus.CounterOpts{
Name: "business_order_total",
Help: "Total number of orders",
},
),
orderAmount: prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "business_order_amount_yuan",
Help: "Total order amount in yuan",
},
),
userRegister: prometheus.NewCounter(
prometheus.CounterOpts{
Name: "business_user_register_total",
Help: "Total number of user registrations",
},
),
}
}
func (m *BusinessMetrics) RecordOrder(amount float64) {
m.orderTotal.Inc()
m.orderAmount.Add(amount)
}
func (m *BusinessMetrics) RecordUserRegister() {
m.userRegister.Inc()
}
// ==================== 自定义 Collector ====================
// CustomCollector 自定义采集器
type CustomCollector struct {
queueSizeDesc *prometheus.Desc
queueLatency *prometheus.Desc
}
func NewCustomCollector() *CustomCollector {
return &CustomCollector{
queueSizeDesc: prometheus.NewDesc(
"queue_size",
"Current queue size",
[]string{"queue_name"},
nil,
),
queueLatency: prometheus.NewDesc(
"queue_latency_seconds",
"Queue processing latency",
[]string{"queue_name"},
nil,
),
}
}
// Describe 实现 Collector 接口
func (c *CustomCollector) Describe(ch chan<- *prometheus.Desc) {
ch <- c.queueSizeDesc
ch <- c.queueLatency
}
// Collect 实现 Collector 接口
func (c *CustomCollector) Collect(ch chan<- prometheus.Metric) {
// 模拟从队列获取指标
queueSize := c.getQueueSize("order_queue")
queueLatency := c.getQueueLatency("order_queue")
ch <- prometheus.MustNewConstMetric(
c.queueSizeDesc,
prometheus.GaugeValue,
float64(queueSize),
"order_queue",
)
ch <- prometheus.MustNewConstMetric(
c.queueLatency,
prometheus.GaugeValue,
queueLatency,
"order_queue",
)
}
func (c *CustomCollector) getQueueSize(queueName string) int {
// 实际实现:从 Redis、RabbitMQ 等获取队列长度
return 100
}
func (c *CustomCollector) getQueueLatency(queueName string) float64 {
// 实际实现:计算队列处理延迟
return 0.5
}
// ==================== Push Gateway ====================
import (
"github.com/prometheus/client_golang/prometheus/push"
)
// PushMetrics 推送指标到 Push Gateway
func PushMetrics() {
// 创建临时指标
jobDuration := prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "job_duration_seconds",
Help: "Job duration",
},
)
// 记录指标
start := time.Now()
doJob()
duration := time.Since(start).Seconds()
jobDuration.Set(duration)
// 推送到 Push Gateway
if err := push.New("http://localhost:9091", "batch_job").
Collector(jobDuration).
Grouping("instance", "batch-server-1").
Push(); err != nil {
fmt.Printf("Could not push to Pushgateway: %v\n", err)
}
}
func doJob() {
time.Sleep(2 * time.Second)
}
// ==================== Main ====================
func main() {
// 注册自定义 Collector
customCollector := NewCustomCollector()
prometheus.MustRegister(customCollector)
// 业务指标
businessMetrics := NewBusinessMetrics()
prometheus.MustRegister(
businessMetrics.orderTotal,
businessMetrics.orderAmount,
businessMetrics.userRegister,
)
// HTTP 服务
mux := http.NewServeMux()
mux.HandleFunc("/api/order", func(w http.ResponseWriter, r *http.Request) {
// 模拟下单
businessMetrics.RecordOrder(99.99)
w.Write([]byte("Order created"))
})
// Prometheus metrics endpoint
mux.Handle("/metrics", promhttp.Handler())
// 应用中间件
handler := MetricsMiddleware(mux)
fmt.Println("Server started at :8080")
fmt.Println("Metrics available at http://localhost:8080/metrics")
http.ListenAndServe(":8080", handler)
}
Prometheus 配置:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
# 告警规则文件
rule_files:
- "alert_rules.yml"
# AlertManager 配置
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# 抓取配置
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 应用服务
- job_name: 'app'
static_configs:
- targets:
- 'app-server-1:8080'
- 'app-server-2:8080'
labels:
env: 'production'
region: 'us-west'
# Node Exporter(系统指标)
- job_name: 'node'
static_configs:
- targets:
- 'node-1:9100'
- 'node-2:9100'
# MySQL Exporter
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
# Redis Exporter
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
# 服务发现(Kubernetes)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
告警规则:
# alert_rules.yml
groups:
- name: app_alerts
interval: 30s
rules:
# API 错误率告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) /
sum(rate(http_requests_total[5m])) by (endpoint) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.endpoint }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
# API 延迟告警
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.endpoint }}"
description: "P99 latency is {{ $value }}s (threshold: 1s)"
# CPU 使用率告警
- alert: HighCPU
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% (threshold: 80%)"
# 内存使用率告警
- alert: HighMemory
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% (threshold: 85%)"
# 磁盘使用率告警
- alert: HighDiskUsage
expr: |
(node_filesystem_size_bytes - node_filesystem_free_bytes) /
node_filesystem_size_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}% on {{ $labels.mountpoint }}"
# 服务宕机告警
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 1 minute"
# 业务指标告警
- alert: OrderDropRate
expr: |
rate(business_order_total[5m]) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Order rate is too low"
description: "Order rate is {{ $value }} orders/sec (threshold: 10)"
AlertManager 配置:
# alertmanager.yml
global:
resolve_timeout: 5m
# 告警路由
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# 严重告警立即发送
- match:
severity: critical
receiver: 'critical'
group_wait: 0s
repeat_interval: 5m
# 警告告警
- match:
severity: warning
receiver: 'warning'
repeat_interval: 1h
# 抑制规则(避免告警风暴)
inhibit_rules:
# 如果实例宕机,抑制该实例的其他告警
- source_match:
alertname: 'InstanceDown'
target_match_re:
instance: '.*'
equal: ['instance']
# 接收器配置
receivers:
- name: 'default'
webhook_configs:
- url: 'http://localhost:5001/webhook'
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
from: 'alert@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alert@example.com'
auth_password: 'password'
headers:
Subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
send_resolved: true
- name: 'warning'
email_configs:
- to: 'team@example.com'
V1 特点:
- 简单易用,社区活跃
- PromQL 查询灵活
- 与 Kubernetes 深度集成
- 单机存储,扩展性有限
- 长期存储需要外部方案
3.3 V2: ELK 日志系统
架构:Elasticsearch + Logstash + Kibana(或 Filebeat)
// ==================== 结构化日志 ====================
package logging
import (
"os"
"go.uber.org/zap"
"go.uber.org/zap/zapcore"
)
// Logger 日志实例
var Logger *zap.Logger
func init() {
// 配置 Encoder
encoderConfig := zapcore.EncoderConfig{
TimeKey: "timestamp",
LevelKey: "level",
NameKey: "logger",
CallerKey: "caller",
MessageKey: "message",
StacktraceKey: "stacktrace",
LineEnding: zapcore.DefaultLineEnding,
EncodeLevel: zapcore.LowercaseLevelEncoder,
EncodeTime: zapcore.ISO8601TimeEncoder,
EncodeDuration: zapcore.SecondsDurationEncoder,
EncodeCaller: zapcore.ShortCallerEncoder,
}
// JSON 格式输出(便于 ELK 解析)
core := zapcore.NewCore(
zapcore.NewJSONEncoder(encoderConfig),
zapcore.AddSync(os.Stdout),
zap.InfoLevel,
)
Logger = zap.New(core, zap.AddCaller(), zap.AddStacktrace(zapcore.ErrorLevel))
}
// 使用示例
func ExampleUsage() {
Logger.Info("User login",
zap.String("user_id", "12345"),
zap.String("ip", "192.168.1.100"),
zap.Duration("duration", 150*time.Millisecond),
)
Logger.Error("Payment failed",
zap.String("order_id", "ORD123"),
zap.Float64("amount", 99.99),
zap.Error(fmt.Errorf("insufficient balance")),
)
}
Filebeat 配置:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
app: myapp
env: production
multiline:
pattern: '^\d{4}-\d{2}-\d{2}'
negate: true
match: after
# 输出到 Elasticsearch
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "app-logs-%{+yyyy.MM.dd}"
# 输出到 Logstash
# output.logstash:
# hosts: ["logstash:5044"]
# 处理器
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
Elasticsearch 索引模板:
{
"index_patterns": ["app-logs-*"],
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "5s"
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"user_id": { "type": "keyword" },
"order_id": { "type": "keyword" },
"ip": { "type": "ip" },
"duration": { "type": "float" },
"error": { "type": "text" }
}
}
}
3.4 V3: 链路追踪(Jaeger)
// ==================== 链路追踪 ====================
package tracing
import (
"context"
"io"
"time"
"github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go/ext"
"github.com/uber/jaeger-client-go"
jaegercfg "github.com/uber/jaeger-client-go/config"
)
// InitTracer 初始化 Tracer
func InitTracer(serviceName string) (opentracing.Tracer, io.Closer, error) {
cfg := jaegercfg.Configuration{
ServiceName: serviceName,
Sampler: &jaegercfg.SamplerConfig{
Type: jaeger.SamplerTypeConst,
Param: 1, // 采样率 100%
},
Reporter: &jaegercfg.ReporterConfig{
LogSpans: true,
BufferFlushInterval: 1 * time.Second,
LocalAgentHostPort: "localhost:6831",
},
}
tracer, closer, err := cfg.NewTracer()
if err != nil {
return nil, nil, err
}
opentracing.SetGlobalTracer(tracer)
return tracer, closer, nil
}
// TracingMiddleware HTTP 追踪中间件
func TracingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
tracer := opentracing.GlobalTracer()
// 从 HTTP Header 提取 SpanContext
spanCtx, _ := tracer.Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header),
)
// 创建 Span
span := tracer.StartSpan(
r.URL.Path,
ext.RPCServerOption(spanCtx),
)
defer span.Finish()
// 设置标签
ext.HTTPMethod.Set(span, r.Method)
ext.HTTPUrl.Set(span, r.URL.String())
// 将 Span 注入到 Context
ctx := opentracing.ContextWithSpan(r.Context(), span)
// 继续处理请求
next.ServeHTTP(w, r.WithContext(ctx))
})
}
// 业务代码示例
func CreateOrder(ctx context.Context, orderID string) error {
// 创建子 Span
span, ctx := opentracing.StartSpanFromContext(ctx, "CreateOrder")
defer span.Finish()
span.SetTag("order_id", orderID)
// 调用库存服务
if err := checkInventory(ctx, orderID); err != nil {
ext.Error.Set(span, true)
span.LogKV("event", "error", "message", err.Error())
return err
}
// 调用支付服务
if err := processPayment(ctx, orderID); err != nil {
ext.Error.Set(span, true)
return err
}
return nil
}
func checkInventory(ctx context.Context, orderID string) error {
span, _ := opentracing.StartSpanFromContext(ctx, "CheckInventory")
defer span.Finish()
// 模拟 RPC 调用
time.Sleep(50 * time.Millisecond)
span.SetTag("inventory_available", true)
return nil
}
func processPayment(ctx context.Context, orderID string) error {
span, _ := opentracing.StartSpanFromContext(ctx, "ProcessPayment")
defer span.Finish()
// 模拟支付处理
time.Sleep(100 * time.Millisecond)
span.SetTag("payment_status", "success")
return nil
}
4. PromQL 查询示例
4.1 基础查询
# 查询当前值
http_requests_total
# 过滤标签
http_requests_total{method="GET", status="200"}
# 正则匹配
http_requests_total{endpoint=~"/api/.*"}
# 范围查询(过去 5 分钟)
http_requests_total[5m]
# 速率计算(每秒请求数)
rate(http_requests_total[5m])
# 总和
sum(http_requests_total) by (endpoint)
# 平均值
avg(http_request_duration_seconds) by (endpoint)
# 分位数
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
4.2 常用监控查询
# QPS
sum(rate(http_requests_total[1m]))
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# P99 延迟
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU 使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100
# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_free_bytes) /
node_filesystem_size_bytes * 100
# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
5. Grafana Dashboard 配置
{
"dashboard": {
"title": "Application Monitoring",
"panels": [
{
"title": "QPS",
"targets": [
{
"expr": "sum(rate(http_requests_total[1m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"legendFormat": "Error Rate"
}
],
"type": "graph",
"yaxes": [
{
"format": "percentunit"
}
]
},
{
"title": "P99 Latency",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"type": "graph"
}
]
}
}
6. 面试问答(10个高频题)
Prometheus 的数据模型是什么?
答:
核心概念:
- Metric Name:指标名称(如
http_requests_total) - Labels:标签键值对(如
{method="GET", status="200"}) - Timestamp:时间戳
- Value:浮点数值
指标类型:
- Counter:单调递增(如请求总数)
- Gauge:可增可减(如内存使用量)
- Histogram:分布统计(如延迟分布)
- Summary:分位数统计(如 P99 延迟)
示例:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234 @1699876200
Pull 和 Push 模型有什么区别?
答:
| 对比项 | Pull(Prometheus) | Push(传统监控) |
|---|---|---|
| 主动性 | 服务端拉取 | 客户端推送 |
| 网络 | 服务端控制,稳定 | 客户端控制,易失败 |
| 服务发现 | 容易(抓取目标) | 困难(需要注册) |
| 调试 | 容易(/metrics 端点) | 困难 |
| 短任务 | 不适合(需要 Push Gateway) | 适合 |
选择建议:
- 长期运行的服务:Pull(Prometheus)
- 短期任务、批处理:Push(Push Gateway)
如何设计告警规则?
答:
原则:
- 可操作性:告警必须需要人工介入
- 准确性:避免误报
- 及时性:尽早发现问题
- 优先级:区分 Critical / Warning
常见告警:
可用性:
# 服务宕机
up == 0
性能:
# 高延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
# 高错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
资源:
# 高 CPU
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
# 高内存
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
业务:
# 订单量下降
rate(business_order_total[5m]) < 10
如何防止告警风暴?
答:
方法:
1. 告警聚合(Group By)
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
2. 告警抑制(Inhibit)
inhibit_rules:
# 如果实例宕机,抑制该实例的其他告警
- source_match:
alertname: 'InstanceDown'
target_match_re:
instance: '.*'
equal: ['instance']
3. 告警静默(Silence)
# 维护期间静默告警
amtool silence add alertname=HighCPU --duration=2h
4. 告警收敛(Repeat Interval)
route:
repeat_interval: 12h # 12小时内相同告警只发一次
监控系统如何保证高可用?
答:
Prometheus 高可用:
方案1:双活部署
Prometheus-1 ──┐
├──> Target (App)
Prometheus-2 ──┘
两个 Prometheus 同时抓取相同目标
优点:简单 缺点:数据冗余,查询需要去重
方案2:联邦集群
Global Prometheus
↓
┌───┴───┐
↓ ↓
Prom-1 Prom-2
↓ ↓
App-1 App-2
配置:
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="app"}'
static_configs:
- targets:
- 'prometheus-1:9090'
- 'prometheus-2:9090'
方案3:远程存储
Prometheus ──> Remote Write ──> Thanos/VictoriaMetrics
如何监控 Kubernetes?
答:
核心指标:
1. 节点指标:
# 节点 CPU
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
# 节点内存
sum(container_memory_working_set_bytes{container!=""}) by (node)
2. Pod 指标:
# Pod CPU
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)
# Pod 内存
sum(container_memory_working_set_bytes{pod=~"app-.*"}) by (pod)
3. 集群指标:
# Pod 数量
count(kube_pod_info)
# 不健康 Pod
count(kube_pod_status_phase{phase!="Running"})
服务发现:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
日志和指标有什么区别?
答:
| 对比项 | 日志(Logs) | 指标(Metrics) |
|---|---|---|
| 数据类型 | 离散事件 | 聚合数据 |
| 存储 | 大(TB级) | 小(GB级) |
| 查询 | 全文搜索 | 时序查询 |
| 实时性 | 一般 | 高 |
| 成本 | 高 | 低 |
| 用途 | 问题排查 | 趋势分析、告警 |
使用场景:
- 指标:监控系统健康、性能趋势、触发告警
- 日志:问题排查、审计、详细上下文
- 追踪:分布式调用链分析
结合使用:
告警触发(Metrics)→ 查看日志(Logs)→ 分析调用链(Traces)
如何降低监控成本?
答:
1. 数据采样:
# 降低采集频率
scrape_interval: 60s # 从 15s 改为 60s
2. 数据保留:
# 缩短保留时间
--storage.tsdb.retention.time=7d # 从 15d 改为 7d
3. 降采样:
# 原始数据 → 聚合数据
sum(rate(http_requests_total[5m])) by (endpoint)
4. 远程存储:
- 热数据(7天):Prometheus 本地
- 冷数据(90天):对象存储(S3)
5. 标签优化:
// 避免高基数标签
// 不好:user_id 作为标签(百万级)
http_requests_total{user_id="12345"}
// 好:放在日志中
log.Info("request", zap.String("user_id", "12345"))
链路追踪如何实现?
答:
核心概念:
- Trace:一次完整请求
- Span:一个操作单元
- SpanContext:追踪上下文(TraceID、SpanID)
实现原理:
1. 生成 TraceID:
TraceID: 64位唯一ID
SpanID: 64位唯一ID
ParentSpanID: 父 Span ID
2. 传递 Context:
// HTTP Header
X-Trace-ID: abc123
X-Span-ID: def456
X-Parent-Span-ID: 789
// gRPC Metadata
metadata.New(map[string]string{
"trace-id": "abc123",
"span-id": "def456",
})
3. 记录 Span:
span := tracer.StartSpan("CreateOrder")
span.SetTag("order_id", "ORD123")
span.SetTag("amount", 99.99)
span.Finish()
4. 上报到后端:
Agent ──> Collector ──> Storage ──> UI
如何设计监控系统的架构?
答:
分层架构:
┌─────────────────────────────────────────┐
│ 采集层 │
│ Exporters / Agents / SDKs │
└──────────────┬──────────────────────────┘
┌─────────────────────────────────────────┐
│ 传输层 │
│ Kafka / Fluentd / Vector │
└──────────────┬──────────────────────────┘
┌─────────────────────────────────────────┐
│ 存储层 │
│ Prometheus / ES / ClickHouse │
└──────────────┬──────────────────────────┘
┌─────────────────────────────────────────┐
│ 查询层 │
│ Grafana / Kibana / Jaeger UI │
└─────────────────────────────────────────┘
技术选型:
| 类型 | 开源方案 | 商业方案 |
|---|---|---|
| 指标 | Prometheus + Grafana | Datadog, New Relic |
| 日志 | ELK / Loki | Splunk, Sumo Logic |
| 追踪 | Jaeger / Zipkin | Datadog APM |
| 统一 | - | Datadog, Dynatrace |
推荐方案:
- 小团队:Prometheus + Loki + Tempo(Grafana 全家桶)
- 大团队:自建 + 商业监控(混合)
7. 总结
核心要点
监控类型
- Metrics:指标、趋势、告警
- Logs:事件、排查
- Traces:调用链、性能
Prometheus
- Pull 模型
- PromQL 查询
- 告警规则
- 服务发现
告警设计
- 可操作性
- 避免误报
- 防止告警风暴
- 分级处理
高可用
- 双活部署
- 联邦集群
- 远程存储
最佳实践
- 指标命名:
<namespace>_<subsystem>_<metric>_<unit> - 标签设计:避免高基数
- 告警规则:先宽后严,逐步优化
- 数据保留:7天热数据 + 90天冷数据
- 成本优化:降采样、远程存储、标签优化
本章完,祝面试顺利!