HiHuo
首页
博客
手册
工具
关于
首页
博客
手册
工具
关于
  • 系统设计实战

    • 系统设计面试教程
    • 系统设计方法论
    • 01-短链系统设计
    • 02 - 秒杀系统设计
    • 03 - IM 即时通讯系统设计
    • 04 - Feed 流系统设计
    • 05 - 分布式 ID 生成器设计
    • 06 - 限流系统设计
    • 第7章:搜索引擎设计
    • 08 - 推荐系统设计
    • 09 - 支付系统设计
    • 10 - 电商系统设计
    • 11 - 直播系统设计
    • 第12章:缓存系统设计
    • 第13章:消息队列设计
    • 第14章:分布式事务
    • 15 - 监控系统设计

15 - 监控系统设计

> 面试频率: 需求类型指标
采集频率10-60 秒
数据保留原始数据 7 天,聚合数据 90 天
查询延迟< 1 秒
告警延迟< 1 分钟
可用性99.9%
扩展性支持 10 万+ 指标

1.3 面试官可能的追问

Q1: 监控系统和日志系统有什么区别?

A1:

  • 监控:关注指标、趋势、告警(Metrics)
  • 日志:关注事件、上下文、问题排查(Logs)
  • 追踪:关注调用链、性能分析(Traces)

Q2: Prometheus 和 Zabbix 有什么区别?

A2:

  • Prometheus:Pull 模型,时序数据库,云原生
  • Zabbix:Push 模型,关系数据库,传统运维
  • 选择:云原生选 Prometheus,传统基础设施选 Zabbix

2. 容量估算

2.1 场景假设

假设监控 1000 台服务器 + 500 个应用:

  • 服务器:1000 台
  • 每台指标数:100 个(CPU、内存、磁盘等)
  • 应用实例:5000 个
  • 每实例指标数:50 个
  • 采集频率:15 秒

2.2 数据量估算

服务器指标:
1000 台 × 100 指标 × 4 采样/分钟 = 40 万 数据点/分钟

应用指标:
5000 实例 × 50 指标 × 4 采样/分钟 = 100 万 数据点/分钟

总计:140 万 数据点/分钟 ≈ 23,333 数据点/秒

2.3 存储估算

单个数据点 = 时间戳(8字节) + 值(8字节) = 16 字节

原始数据(7天):
140 万/分钟 × 60 × 24 × 7 × 16 字节 ≈ 162 TB

实际 Prometheus 压缩后约 1-2 字节/样本:
162 TB / 8 ≈ 20 TB

2.4 查询 QPS 估算

Dashboard 数量 = 200 个
平均刷新间隔 = 30 秒
每个 Dashboard 查询数 = 10

查询 QPS = 200 × 10 / 30 ≈ 67 QPS

3. 架构设计

3.1 整体架构

┌─────────────────────────────────────────────────────────────┐
│                      数据源                                  │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  服务器  │  │  应用    │  │  数据库  │  │  中间件  │  │
│  │(Node)    │  │(App)     │  │(MySQL)   │  │(Redis)   │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
└───────┼─────────────┼─────────────┼─────────────┼─────────┘
        │             │             │             │
                                               
┌─────────────────────────────────────────────────────────────┐
│                    采集层                                    │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Node     │  │ App      │  │ MySQL    │  │ Filebeat │  │
│  │ Exporter │  │ Exporter │  │ Exporter │  │(日志)    │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────┬───────────────────────────────────────────────┘
              │
              
┌─────────────────────────────────────────────────────────────┐
│                    存储层                                    │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │ Prometheus   │  │ Elasticsearch│  │   Jaeger     │    │
│  │   (指标)     │  │   (日志)     │  │  (链路)      │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────┬───────────────────────────────────────────────┘
              │
              
┌─────────────────────────────────────────────────────────────┐
│                    查询 & 告警层                             │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐                       │
│  │ AlertManager │  │   Grafana    │                       │
│  │  (告警)      │  │  (可视化)    │                       │
│  └──────────────┘  └──────────────┘                       │
└─────────────────────────────────────────────────────────────┘
              │
              
┌─────────────────────────────────────────────────────────────┐
│                    通知层                                    │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                │
│  │  邮件    │  │  短信    │  │  钉钉    │                │
│  └──────────┘  └──────────┘  └──────────┘                │
└─────────────────────────────────────────────────────────────┘

3.2 V1: 基于 Prometheus + Grafana

适用场景:中小型公司、微服务架构

package main

import (
    "fmt"
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// ==================== 自定义 Exporter ====================

// Metrics 指标定义
var (
    // Counter: 单调递增
    requestTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    // Gauge: 可增可减
    activeConnections = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
    
    // Histogram: 分布统计
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latencies",
            Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1, 2, 5},
        },
        []string{"method", "endpoint"},
    )
    
    // Summary: 分位数统计
    requestSize = prometheus.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_request_size_bytes",
            Help:       "HTTP request sizes",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    // 注册指标
    prometheus.MustRegister(requestTotal)
    prometheus.MustRegister(activeConnections)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(requestSize)
}

// MetricsMiddleware 指标采集中间件
func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // 增加活跃连接数
        activeConnections.Inc()
        defer activeConnections.Dec()
        
        // 包装 ResponseWriter 以获取状态码
        rw := &responseWriter{ResponseWriter: w, statusCode: 200}
        
        // 处理请求
        next.ServeHTTP(rw, r)
        
        // 记录指标
        duration := time.Since(start).Seconds()
        
        requestTotal.WithLabelValues(
            r.Method,
            r.URL.Path,
            fmt.Sprintf("%d", rw.statusCode),
        ).Inc()
        
        requestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(duration)
        
        requestSize.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(float64(r.ContentLength))
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// ==================== 业务指标 ====================

// BusinessMetrics 业务指标
type BusinessMetrics struct {
    orderTotal   prometheus.Counter
    orderAmount  prometheus.Gauge
    userRegister prometheus.Counter
}

func NewBusinessMetrics() *BusinessMetrics {
    return &BusinessMetrics{
        orderTotal: prometheus.NewCounter(
            prometheus.CounterOpts{
                Name: "business_order_total",
                Help: "Total number of orders",
            },
        ),
        orderAmount: prometheus.NewGauge(
            prometheus.GaugeOpts{
                Name: "business_order_amount_yuan",
                Help: "Total order amount in yuan",
            },
        ),
        userRegister: prometheus.NewCounter(
            prometheus.CounterOpts{
                Name: "business_user_register_total",
                Help: "Total number of user registrations",
            },
        ),
    }
}

func (m *BusinessMetrics) RecordOrder(amount float64) {
    m.orderTotal.Inc()
    m.orderAmount.Add(amount)
}

func (m *BusinessMetrics) RecordUserRegister() {
    m.userRegister.Inc()
}

// ==================== 自定义 Collector ====================

// CustomCollector 自定义采集器
type CustomCollector struct {
    queueSizeDesc *prometheus.Desc
    queueLatency  *prometheus.Desc
}

func NewCustomCollector() *CustomCollector {
    return &CustomCollector{
        queueSizeDesc: prometheus.NewDesc(
            "queue_size",
            "Current queue size",
            []string{"queue_name"},
            nil,
        ),
        queueLatency: prometheus.NewDesc(
            "queue_latency_seconds",
            "Queue processing latency",
            []string{"queue_name"},
            nil,
        ),
    }
}

// Describe 实现 Collector 接口
func (c *CustomCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.queueSizeDesc
    ch <- c.queueLatency
}

// Collect 实现 Collector 接口
func (c *CustomCollector) Collect(ch chan<- prometheus.Metric) {
    // 模拟从队列获取指标
    queueSize := c.getQueueSize("order_queue")
    queueLatency := c.getQueueLatency("order_queue")
    
    ch <- prometheus.MustNewConstMetric(
        c.queueSizeDesc,
        prometheus.GaugeValue,
        float64(queueSize),
        "order_queue",
    )
    
    ch <- prometheus.MustNewConstMetric(
        c.queueLatency,
        prometheus.GaugeValue,
        queueLatency,
        "order_queue",
    )
}

func (c *CustomCollector) getQueueSize(queueName string) int {
    // 实际实现:从 Redis、RabbitMQ 等获取队列长度
    return 100
}

func (c *CustomCollector) getQueueLatency(queueName string) float64 {
    // 实际实现:计算队列处理延迟
    return 0.5
}

// ==================== Push Gateway ====================

import (
    "github.com/prometheus/client_golang/prometheus/push"
)

// PushMetrics 推送指标到 Push Gateway
func PushMetrics() {
    // 创建临时指标
    jobDuration := prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "job_duration_seconds",
            Help: "Job duration",
        },
    )
    
    // 记录指标
    start := time.Now()
    doJob()
    duration := time.Since(start).Seconds()
    
    jobDuration.Set(duration)
    
    // 推送到 Push Gateway
    if err := push.New("http://localhost:9091", "batch_job").
        Collector(jobDuration).
        Grouping("instance", "batch-server-1").
        Push(); err != nil {
        fmt.Printf("Could not push to Pushgateway: %v\n", err)
    }
}

func doJob() {
    time.Sleep(2 * time.Second)
}

// ==================== Main ====================

func main() {
    // 注册自定义 Collector
    customCollector := NewCustomCollector()
    prometheus.MustRegister(customCollector)
    
    // 业务指标
    businessMetrics := NewBusinessMetrics()
    prometheus.MustRegister(
        businessMetrics.orderTotal,
        businessMetrics.orderAmount,
        businessMetrics.userRegister,
    )
    
    // HTTP 服务
    mux := http.NewServeMux()
    
    mux.HandleFunc("/api/order", func(w http.ResponseWriter, r *http.Request) {
        // 模拟下单
        businessMetrics.RecordOrder(99.99)
        w.Write([]byte("Order created"))
    })
    
    // Prometheus metrics endpoint
    mux.Handle("/metrics", promhttp.Handler())
    
    // 应用中间件
    handler := MetricsMiddleware(mux)
    
    fmt.Println("Server started at :8080")
    fmt.Println("Metrics available at http://localhost:8080/metrics")
    http.ListenAndServe(":8080", handler)
}

Prometheus 配置:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 告警规则文件
rule_files:
  - "alert_rules.yml"

# AlertManager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 抓取配置
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # 应用服务
  - job_name: 'app'
    static_configs:
      - targets: 
          - 'app-server-1:8080'
          - 'app-server-2:8080'
        labels:
          env: 'production'
          region: 'us-west'
  
  # Node Exporter(系统指标)
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-1:9100'
          - 'node-2:9100'
  
  # MySQL Exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']
  
  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
  
  # 服务发现(Kubernetes)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

告警规则:

# alert_rules.yml
groups:
  - name: app_alerts
    interval: 30s
    rules:
      # API 错误率告警
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) /
          sum(rate(http_requests_total[5m])) by (endpoint) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.endpoint }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
      
      # API 延迟告警
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.endpoint }}"
          description: "P99 latency is {{ $value }}s (threshold: 1s)"
      
      # CPU 使用率告警
      - alert: HighCPU
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% (threshold: 80%)"
      
      # 内存使用率告警
      - alert: HighMemory
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 
          node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% (threshold: 85%)"
      
      # 磁盘使用率告警
      - alert: HighDiskUsage
        expr: |
          (node_filesystem_size_bytes - node_filesystem_free_bytes) / 
          node_filesystem_size_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk usage on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}% on {{ $labels.mountpoint }}"
      
      # 服务宕机告警
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 1 minute"
      
      # 业务指标告警
      - alert: OrderDropRate
        expr: |
          rate(business_order_total[5m]) < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Order rate is too low"
          description: "Order rate is {{ $value }} orders/sec (threshold: 10)"

AlertManager 配置:

# alertmanager.yml
global:
  resolve_timeout: 5m

# 告警路由
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  
  routes:
    # 严重告警立即发送
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 0s
      repeat_interval: 5m
    
    # 警告告警
    - match:
        severity: warning
      receiver: 'warning'
      repeat_interval: 1h

# 抑制规则(避免告警风暴)
inhibit_rules:
  # 如果实例宕机,抑制该实例的其他告警
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      instance: '.*'
    equal: ['instance']

# 接收器配置
receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'
  
  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        from: 'alert@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alert@example.com'
        auth_password: 'password'
        headers:
          Subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
    
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
        send_resolved: true
  
  - name: 'warning'
    email_configs:
      - to: 'team@example.com'

V1 特点:

  • 简单易用,社区活跃
  • PromQL 查询灵活
  • 与 Kubernetes 深度集成
  • 单机存储,扩展性有限
  • 长期存储需要外部方案

3.3 V2: ELK 日志系统

架构:Elasticsearch + Logstash + Kibana(或 Filebeat)

// ==================== 结构化日志 ====================

package logging

import (
    "os"
    
    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

// Logger 日志实例
var Logger *zap.Logger

func init() {
    // 配置 Encoder
    encoderConfig := zapcore.EncoderConfig{
        TimeKey:        "timestamp",
        LevelKey:       "level",
        NameKey:        "logger",
        CallerKey:      "caller",
        MessageKey:     "message",
        StacktraceKey:  "stacktrace",
        LineEnding:     zapcore.DefaultLineEnding,
        EncodeLevel:    zapcore.LowercaseLevelEncoder,
        EncodeTime:     zapcore.ISO8601TimeEncoder,
        EncodeDuration: zapcore.SecondsDurationEncoder,
        EncodeCaller:   zapcore.ShortCallerEncoder,
    }
    
    // JSON 格式输出(便于 ELK 解析)
    core := zapcore.NewCore(
        zapcore.NewJSONEncoder(encoderConfig),
        zapcore.AddSync(os.Stdout),
        zap.InfoLevel,
    )
    
    Logger = zap.New(core, zap.AddCaller(), zap.AddStacktrace(zapcore.ErrorLevel))
}

// 使用示例
func ExampleUsage() {
    Logger.Info("User login",
        zap.String("user_id", "12345"),
        zap.String("ip", "192.168.1.100"),
        zap.Duration("duration", 150*time.Millisecond),
    )
    
    Logger.Error("Payment failed",
        zap.String("order_id", "ORD123"),
        zap.Float64("amount", 99.99),
        zap.Error(fmt.Errorf("insufficient balance")),
    )
}

Filebeat 配置:

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    fields:
      app: myapp
      env: production
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

# 输出到 Elasticsearch
output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "app-logs-%{+yyyy.MM.dd}"

# 输出到 Logstash
# output.logstash:
#   hosts: ["logstash:5044"]

# 处理器
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - decode_json_fields:
      fields: ["message"]
      target: ""
      overwrite_keys: true

Elasticsearch 索引模板:

{
  "index_patterns": ["app-logs-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "5s"
  },
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "message": { "type": "text" },
      "user_id": { "type": "keyword" },
      "order_id": { "type": "keyword" },
      "ip": { "type": "ip" },
      "duration": { "type": "float" },
      "error": { "type": "text" }
    }
  }
}

3.4 V3: 链路追踪(Jaeger)

// ==================== 链路追踪 ====================

package tracing

import (
    "context"
    "io"
    "time"
    
    "github.com/opentracing/opentracing-go"
    "github.com/opentracing/opentracing-go/ext"
    "github.com/uber/jaeger-client-go"
    jaegercfg "github.com/uber/jaeger-client-go/config"
)

// InitTracer 初始化 Tracer
func InitTracer(serviceName string) (opentracing.Tracer, io.Closer, error) {
    cfg := jaegercfg.Configuration{
        ServiceName: serviceName,
        Sampler: &jaegercfg.SamplerConfig{
            Type:  jaeger.SamplerTypeConst,
            Param: 1, // 采样率 100%
        },
        Reporter: &jaegercfg.ReporterConfig{
            LogSpans:            true,
            BufferFlushInterval: 1 * time.Second,
            LocalAgentHostPort:  "localhost:6831",
        },
    }
    
    tracer, closer, err := cfg.NewTracer()
    if err != nil {
        return nil, nil, err
    }
    
    opentracing.SetGlobalTracer(tracer)
    return tracer, closer, nil
}

// TracingMiddleware HTTP 追踪中间件
func TracingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tracer := opentracing.GlobalTracer()
        
        // 从 HTTP Header 提取 SpanContext
        spanCtx, _ := tracer.Extract(
            opentracing.HTTPHeaders,
            opentracing.HTTPHeadersCarrier(r.Header),
        )
        
        // 创建 Span
        span := tracer.StartSpan(
            r.URL.Path,
            ext.RPCServerOption(spanCtx),
        )
        defer span.Finish()
        
        // 设置标签
        ext.HTTPMethod.Set(span, r.Method)
        ext.HTTPUrl.Set(span, r.URL.String())
        
        // 将 Span 注入到 Context
        ctx := opentracing.ContextWithSpan(r.Context(), span)
        
        // 继续处理请求
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

// 业务代码示例
func CreateOrder(ctx context.Context, orderID string) error {
    // 创建子 Span
    span, ctx := opentracing.StartSpanFromContext(ctx, "CreateOrder")
    defer span.Finish()
    
    span.SetTag("order_id", orderID)
    
    // 调用库存服务
    if err := checkInventory(ctx, orderID); err != nil {
        ext.Error.Set(span, true)
        span.LogKV("event", "error", "message", err.Error())
        return err
    }
    
    // 调用支付服务
    if err := processPayment(ctx, orderID); err != nil {
        ext.Error.Set(span, true)
        return err
    }
    
    return nil
}

func checkInventory(ctx context.Context, orderID string) error {
    span, _ := opentracing.StartSpanFromContext(ctx, "CheckInventory")
    defer span.Finish()
    
    // 模拟 RPC 调用
    time.Sleep(50 * time.Millisecond)
    
    span.SetTag("inventory_available", true)
    return nil
}

func processPayment(ctx context.Context, orderID string) error {
    span, _ := opentracing.StartSpanFromContext(ctx, "ProcessPayment")
    defer span.Finish()
    
    // 模拟支付处理
    time.Sleep(100 * time.Millisecond)
    
    span.SetTag("payment_status", "success")
    return nil
}

4. PromQL 查询示例

4.1 基础查询

# 查询当前值
http_requests_total

# 过滤标签
http_requests_total{method="GET", status="200"}

# 正则匹配
http_requests_total{endpoint=~"/api/.*"}

# 范围查询(过去 5 分钟)
http_requests_total[5m]

# 速率计算(每秒请求数)
rate(http_requests_total[5m])

# 总和
sum(http_requests_total) by (endpoint)

# 平均值
avg(http_request_duration_seconds) by (endpoint)

# 分位数
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

4.2 常用监控查询

# QPS
sum(rate(http_requests_total[1m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

# P99 延迟
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU 使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 
node_memory_MemTotal_bytes * 100

# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_free_bytes) / 
node_filesystem_size_bytes * 100

# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

5. Grafana Dashboard 配置

{
  "dashboard": {
    "title": "Application Monitoring",
    "panels": [
      {
        "title": "QPS",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
            "legendFormat": "Error Rate"
          }
        ],
        "type": "graph",
        "yaxes": [
          {
            "format": "percentunit"
          }
        ]
      },
      {
        "title": "P99 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

6. 面试问答(10个高频题)

Prometheus 的数据模型是什么?

答:

核心概念:

  • Metric Name:指标名称(如 http_requests_total)
  • Labels:标签键值对(如 {method="GET", status="200"})
  • Timestamp:时间戳
  • Value:浮点数值

指标类型:

  1. Counter:单调递增(如请求总数)
  2. Gauge:可增可减(如内存使用量)
  3. Histogram:分布统计(如延迟分布)
  4. Summary:分位数统计(如 P99 延迟)

示例:

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234 @1699876200

Pull 和 Push 模型有什么区别?

答:

对比项Pull(Prometheus)Push(传统监控)
主动性服务端拉取客户端推送
网络服务端控制,稳定客户端控制,易失败
服务发现容易(抓取目标)困难(需要注册)
调试容易(/metrics 端点)困难
短任务不适合(需要 Push Gateway)适合

选择建议:

  • 长期运行的服务:Pull(Prometheus)
  • 短期任务、批处理:Push(Push Gateway)

如何设计告警规则?

答:

原则:

  1. 可操作性:告警必须需要人工介入
  2. 准确性:避免误报
  3. 及时性:尽早发现问题
  4. 优先级:区分 Critical / Warning

常见告警:

可用性:

# 服务宕机
up == 0

性能:

# 高延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1

# 高错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

资源:

# 高 CPU
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

# 高内存
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85

业务:

# 订单量下降
rate(business_order_total[5m]) < 10

如何防止告警风暴?

答:

方法:

1. 告警聚合(Group By)

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s

2. 告警抑制(Inhibit)

inhibit_rules:
  # 如果实例宕机,抑制该实例的其他告警
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      instance: '.*'
    equal: ['instance']

3. 告警静默(Silence)

# 维护期间静默告警
amtool silence add alertname=HighCPU --duration=2h

4. 告警收敛(Repeat Interval)

route:
  repeat_interval: 12h  # 12小时内相同告警只发一次

监控系统如何保证高可用?

答:

Prometheus 高可用:

方案1:双活部署

Prometheus-1 ──┐
               ├──> Target (App)
Prometheus-2 ──┘

两个 Prometheus 同时抓取相同目标

优点:简单 缺点:数据冗余,查询需要去重

方案2:联邦集群

Global Prometheus
      ↓
  ┌───┴───┐
  ↓       ↓
Prom-1  Prom-2
  ↓       ↓
App-1   App-2

配置:

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="app"}'
    static_configs:
      - targets:
          - 'prometheus-1:9090'
          - 'prometheus-2:9090'

方案3:远程存储

Prometheus ──> Remote Write ──> Thanos/VictoriaMetrics

如何监控 Kubernetes?

答:

核心指标:

1. 节点指标:

# 节点 CPU
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)

# 节点内存
sum(container_memory_working_set_bytes{container!=""}) by (node)

2. Pod 指标:

# Pod CPU
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

# Pod 内存
sum(container_memory_working_set_bytes{pod=~"app-.*"}) by (pod)

3. 集群指标:

# Pod 数量
count(kube_pod_info)

# 不健康 Pod
count(kube_pod_status_phase{phase!="Running"})

服务发现:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

日志和指标有什么区别?

答:

对比项日志(Logs)指标(Metrics)
数据类型离散事件聚合数据
存储大(TB级)小(GB级)
查询全文搜索时序查询
实时性一般高
成本高低
用途问题排查趋势分析、告警

使用场景:

  • 指标:监控系统健康、性能趋势、触发告警
  • 日志:问题排查、审计、详细上下文
  • 追踪:分布式调用链分析

结合使用:

告警触发(Metrics)→ 查看日志(Logs)→ 分析调用链(Traces)

如何降低监控成本?

答:

1. 数据采样:

# 降低采集频率
scrape_interval: 60s  # 从 15s 改为 60s

2. 数据保留:

# 缩短保留时间
--storage.tsdb.retention.time=7d  # 从 15d 改为 7d

3. 降采样:

# 原始数据 → 聚合数据
sum(rate(http_requests_total[5m])) by (endpoint)

4. 远程存储:

  • 热数据(7天):Prometheus 本地
  • 冷数据(90天):对象存储(S3)

5. 标签优化:

// 避免高基数标签
//  不好:user_id 作为标签(百万级)
http_requests_total{user_id="12345"}

//  好:放在日志中
log.Info("request", zap.String("user_id", "12345"))

链路追踪如何实现?

答:

核心概念:

  • Trace:一次完整请求
  • Span:一个操作单元
  • SpanContext:追踪上下文(TraceID、SpanID)

实现原理:

1. 生成 TraceID:

TraceID: 64位唯一ID
SpanID: 64位唯一ID
ParentSpanID: 父 Span ID

2. 传递 Context:

// HTTP Header
X-Trace-ID: abc123
X-Span-ID: def456
X-Parent-Span-ID: 789

// gRPC Metadata
metadata.New(map[string]string{
    "trace-id": "abc123",
    "span-id": "def456",
})

3. 记录 Span:

span := tracer.StartSpan("CreateOrder")
span.SetTag("order_id", "ORD123")
span.SetTag("amount", 99.99)
span.Finish()

4. 上报到后端:

Agent ──> Collector ──> Storage ──> UI

如何设计监控系统的架构?

答:

分层架构:

┌─────────────────────────────────────────┐
│           采集层                         │
│  Exporters / Agents / SDKs              │
└──────────────┬──────────────────────────┘
               
┌─────────────────────────────────────────┐
│           传输层                         │
│  Kafka / Fluentd / Vector               │
└──────────────┬──────────────────────────┘
               
┌─────────────────────────────────────────┐
│           存储层                         │
│  Prometheus / ES / ClickHouse           │
└──────────────┬──────────────────────────┘
               
┌─────────────────────────────────────────┐
│           查询层                         │
│  Grafana / Kibana / Jaeger UI          │
└─────────────────────────────────────────┘

技术选型:

类型开源方案商业方案
指标Prometheus + GrafanaDatadog, New Relic
日志ELK / LokiSplunk, Sumo Logic
追踪Jaeger / ZipkinDatadog APM
统一-Datadog, Dynatrace

推荐方案:

  • 小团队:Prometheus + Loki + Tempo(Grafana 全家桶)
  • 大团队:自建 + 商业监控(混合)

7. 总结

核心要点

  1. 监控类型

    • Metrics:指标、趋势、告警
    • Logs:事件、排查
    • Traces:调用链、性能
  2. Prometheus

    • Pull 模型
    • PromQL 查询
    • 告警规则
    • 服务发现
  3. 告警设计

    • 可操作性
    • 避免误报
    • 防止告警风暴
    • 分级处理
  4. 高可用

    • 双活部署
    • 联邦集群
    • 远程存储

最佳实践

  1. 指标命名:<namespace>_<subsystem>_<metric>_<unit>
  2. 标签设计:避免高基数
  3. 告警规则:先宽后严,逐步优化
  4. 数据保留:7天热数据 + 90天冷数据
  5. 成本优化:降采样、远程存储、标签优化

本章完,祝面试顺利!

Prev
第14章:分布式事务