15 - 监控系统设计

> 面试频率：需求类型	指标
采集频率	10-60 秒
数据保留	原始数据 7 天，聚合数据 90 天
查询延迟	< 1 秒
告警延迟	< 1 分钟
可用性	99.9%
扩展性	支持 10 万+ 指标

1.3 面试官可能的追问

Q1: 监控系统和日志系统有什么区别？

A1:

监控：关注指标、趋势、告警（Metrics）
日志：关注事件、上下文、问题排查（Logs）
追踪：关注调用链、性能分析（Traces）

Q2: Prometheus 和 Zabbix 有什么区别？

A2:

Prometheus：Pull 模型，时序数据库，云原生
Zabbix：Push 模型，关系数据库，传统运维
选择：云原生选 Prometheus，传统基础设施选 Zabbix

2. 容量估算

2.1 场景假设

假设监控 1000 台服务器 + 500 个应用：

服务器：1000 台
每台指标数：100 个（CPU、内存、磁盘等）
应用实例：5000 个
每实例指标数：50 个
采集频率：15 秒

2.2 数据量估算

服务器指标：
1000 台 × 100 指标 × 4 采样/分钟 = 40 万 数据点/分钟

应用指标：
5000 实例 × 50 指标 × 4 采样/分钟 = 100 万 数据点/分钟

总计：140 万 数据点/分钟 ≈ 23,333 数据点/秒

2.3 存储估算

单个数据点 = 时间戳(8字节) + 值(8字节) = 16 字节

原始数据（7天）：
140 万/分钟 × 60 × 24 × 7 × 16 字节 ≈ 162 TB

实际 Prometheus 压缩后约 1-2 字节/样本：
162 TB / 8 ≈ 20 TB

2.4 查询 QPS 估算

Dashboard 数量 = 200 个
平均刷新间隔 = 30 秒
每个 Dashboard 查询数 = 10

查询 QPS = 200 × 10 / 30 ≈ 67 QPS

3. 架构设计

3.1 整体架构

┌─────────────────────────────────────────────────────────────┐
│                      数据源                                  │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  服务器  │  │  应用    │  │  数据库  │  │  中间件  │  │
│  │(Node)    │  │(App)     │  │(MySQL)   │  │(Redis)   │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
└───────┼─────────────┼─────────────┼─────────────┼─────────┘
        │             │             │             │
                                               
┌─────────────────────────────────────────────────────────────┐
│                    采集层                                    │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Node     │  │ App      │  │ MySQL    │  │ Filebeat │  │
│  │ Exporter │  │ Exporter │  │ Exporter │  │(日志)    │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────┬───────────────────────────────────────────────┘
              │
              
┌─────────────────────────────────────────────────────────────┐
│                    存储层                                    │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │ Prometheus   │  │ Elasticsearch│  │   Jaeger     │    │
│  │   (指标)     │  │   (日志)     │  │  (链路)      │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────┬───────────────────────────────────────────────┘
              │
              
┌─────────────────────────────────────────────────────────────┐
│                    查询 & 告警层                             │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐                       │
│  │ AlertManager │  │   Grafana    │                       │
│  │  (告警)      │  │  (可视化)    │                       │
│  └──────────────┘  └──────────────┘                       │
└─────────────────────────────────────────────────────────────┘
              │
              
┌─────────────────────────────────────────────────────────────┐
│                    通知层                                    │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                │
│  │  邮件    │  │  短信    │  │  钉钉    │                │
│  └──────────┘  └──────────┘  └──────────┘                │
└─────────────────────────────────────────────────────────────┘

3.2 V1: 基于 Prometheus + Grafana

适用场景：中小型公司、微服务架构

package main

import (
    "fmt"
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// ==================== 自定义 Exporter ====================

// Metrics 指标定义
var (
    // Counter: 单调递增
    requestTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    // Gauge: 可增可减
    activeConnections = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
    
    // Histogram: 分布统计
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latencies",
            Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1, 2, 5},
        },
        []string{"method", "endpoint"},
    )
    
    // Summary: 分位数统计
    requestSize = prometheus.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_request_size_bytes",
            Help:       "HTTP request sizes",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    // 注册指标
    prometheus.MustRegister(requestTotal)
    prometheus.MustRegister(activeConnections)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(requestSize)
}

// MetricsMiddleware 指标采集中间件
func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // 增加活跃连接数
        activeConnections.Inc()
        defer activeConnections.Dec()
        
        // 包装 ResponseWriter 以获取状态码
        rw := &responseWriter{ResponseWriter: w, statusCode: 200}
        
        // 处理请求
        next.ServeHTTP(rw, r)
        
        // 记录指标
        duration := time.Since(start).Seconds()
        
        requestTotal.WithLabelValues(
            r.Method,
            r.URL.Path,
            fmt.Sprintf("%d", rw.statusCode),
        ).Inc()
        
        requestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(duration)
        
        requestSize.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(float64(r.ContentLength))
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// ==================== 业务指标 ====================

// BusinessMetrics 业务指标
type BusinessMetrics struct {
    orderTotal   prometheus.Counter
    orderAmount  prometheus.Gauge
    userRegister prometheus.Counter
}

func NewBusinessMetrics() *BusinessMetrics {
    return &BusinessMetrics{
        orderTotal: prometheus.NewCounter(
            prometheus.CounterOpts{
                Name: "business_order_total",
                Help: "Total number of orders",
            },
        ),
        orderAmount: prometheus.NewGauge(
            prometheus.GaugeOpts{
                Name: "business_order_amount_yuan",
                Help: "Total order amount in yuan",
            },
        ),
        userRegister: prometheus.NewCounter(
            prometheus.CounterOpts{
                Name: "business_user_register_total",
                Help: "Total number of user registrations",
            },
        ),
    }
}

func (m *BusinessMetrics) RecordOrder(amount float64) {
    m.orderTotal.Inc()
    m.orderAmount.Add(amount)
}

func (m *BusinessMetrics) RecordUserRegister() {
    m.userRegister.Inc()
}

// ==================== 自定义 Collector ====================

// CustomCollector 自定义采集器
type CustomCollector struct {
    queueSizeDesc *prometheus.Desc
    queueLatency  *prometheus.Desc
}

func NewCustomCollector() *CustomCollector {
    return &CustomCollector{
        queueSizeDesc: prometheus.NewDesc(
            "queue_size",
            "Current queue size",
            []string{"queue_name"},
            nil,
        ),
        queueLatency: prometheus.NewDesc(
            "queue_latency_seconds",
            "Queue processing latency",
            []string{"queue_name"},
            nil,
        ),
    }
}

// Describe 实现 Collector 接口
func (c *CustomCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.queueSizeDesc
    ch <- c.queueLatency
}

// Collect 实现 Collector 接口
func (c *CustomCollector) Collect(ch chan<- prometheus.Metric) {
    // 模拟从队列获取指标
    queueSize := c.getQueueSize("order_queue")
    queueLatency := c.getQueueLatency("order_queue")
    
    ch <- prometheus.MustNewConstMetric(
        c.queueSizeDesc,
        prometheus.GaugeValue,
        float64(queueSize),
        "order_queue",
    )
    
    ch <- prometheus.MustNewConstMetric(
        c.queueLatency,
        prometheus.GaugeValue,
        queueLatency,
        "order_queue",
    )
}

func (c *CustomCollector) getQueueSize(queueName string) int {
    // 实际实现：从 Redis、RabbitMQ 等获取队列长度
    return 100
}

func (c *CustomCollector) getQueueLatency(queueName string) float64 {
    // 实际实现：计算队列处理延迟
    return 0.5
}

// ==================== Push Gateway ====================

import (
    "github.com/prometheus/client_golang/prometheus/push"
)

// PushMetrics 推送指标到 Push Gateway
func PushMetrics() {
    // 创建临时指标
    jobDuration := prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "job_duration_seconds",
            Help: "Job duration",
        },
    )
    
    // 记录指标
    start := time.Now()
    doJob()
    duration := time.Since(start).Seconds()
    
    jobDuration.Set(duration)
    
    // 推送到 Push Gateway
    if err := push.New("http://localhost:9091", "batch_job").
        Collector(jobDuration).
        Grouping("instance", "batch-server-1").
        Push(); err != nil {
        fmt.Printf("Could not push to Pushgateway: %v\n", err)
    }
}

func doJob() {
    time.Sleep(2 * time.Second)
}

// ==================== Main ====================

func main() {
    // 注册自定义 Collector
    customCollector := NewCustomCollector()
    prometheus.MustRegister(customCollector)
    
    // 业务指标
    businessMetrics := NewBusinessMetrics()
    prometheus.MustRegister(
        businessMetrics.orderTotal,
        businessMetrics.orderAmount,
        businessMetrics.userRegister,
    )
    
    // HTTP 服务
    mux := http.NewServeMux()
    
    mux.HandleFunc("/api/order", func(w http.ResponseWriter, r *http.Request) {
        // 模拟下单
        businessMetrics.RecordOrder(99.99)
        w.Write([]byte("Order created"))
    })
    
    // Prometheus metrics endpoint
    mux.Handle("/metrics", promhttp.Handler())
    
    // 应用中间件
    handler := MetricsMiddleware(mux)
    
    fmt.Println("Server started at :8080")
    fmt.Println("Metrics available at http://localhost:8080/metrics")
    http.ListenAndServe(":8080", handler)
}

Prometheus 配置：

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 告警规则文件
rule_files:
  - "alert_rules.yml"

# AlertManager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 抓取配置
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # 应用服务
  - job_name: 'app'
    static_configs:
      - targets: 
          - 'app-server-1:8080'
          - 'app-server-2:8080'
        labels:
          env: 'production'
          region: 'us-west'
  
  # Node Exporter（系统指标）
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-1:9100'
          - 'node-2:9100'
  
  # MySQL Exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']
  
  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
  
  # 服务发现（Kubernetes）
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

告警规则：

# alert_rules.yml
groups:
  - name: app_alerts
    interval: 30s
    rules:
      # API 错误率告警
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) /
          sum(rate(http_requests_total[5m])) by (endpoint) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.endpoint }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
      
      # API 延迟告警
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.endpoint }}"
          description: "P99 latency is {{ $value }}s (threshold: 1s)"
      
      # CPU 使用率告警
      - alert: HighCPU
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% (threshold: 80%)"
      
      # 内存使用率告警
      - alert: HighMemory
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 
          node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% (threshold: 85%)"
      
      # 磁盘使用率告警
      - alert: HighDiskUsage
        expr: |
          (node_filesystem_size_bytes - node_filesystem_free_bytes) / 
          node_filesystem_size_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk usage on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}% on {{ $labels.mountpoint }}"
      
      # 服务宕机告警
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 1 minute"
      
      # 业务指标告警
      - alert: OrderDropRate
        expr: |
          rate(business_order_total[5m]) < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Order rate is too low"
          description: "Order rate is {{ $value }} orders/sec (threshold: 10)"

AlertManager 配置：

# alertmanager.yml
global:
  resolve_timeout: 5m

# 告警路由
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  
  routes:
    # 严重告警立即发送
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 0s
      repeat_interval: 5m
    
    # 警告告警
    - match:
        severity: warning
      receiver: 'warning'
      repeat_interval: 1h

# 抑制规则（避免告警风暴）
inhibit_rules:
  # 如果实例宕机，抑制该实例的其他告警
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      instance: '.*'
    equal: ['instance']

# 接收器配置
receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'
  
  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        from: 'alert@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alert@example.com'
        auth_password: 'password'
        headers:
          Subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
    
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
        send_resolved: true
  
  - name: 'warning'
    email_configs:
      - to: 'team@example.com'

V1 特点：

简单易用，社区活跃
PromQL 查询灵活
与 Kubernetes 深度集成
单机存储，扩展性有限
长期存储需要外部方案

3.3 V2: ELK 日志系统

架构：Elasticsearch + Logstash + Kibana（或 Filebeat）

// ==================== 结构化日志 ====================

package logging

import (
    "os"
    
    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

// Logger 日志实例
var Logger *zap.Logger

func init() {
    // 配置 Encoder
    encoderConfig := zapcore.EncoderConfig{
        TimeKey:        "timestamp",
        LevelKey:       "level",
        NameKey:        "logger",
        CallerKey:      "caller",
        MessageKey:     "message",
        StacktraceKey:  "stacktrace",
        LineEnding:     zapcore.DefaultLineEnding,
        EncodeLevel:    zapcore.LowercaseLevelEncoder,
        EncodeTime:     zapcore.ISO8601TimeEncoder,
        EncodeDuration: zapcore.SecondsDurationEncoder,
        EncodeCaller:   zapcore.ShortCallerEncoder,
    }
    
    // JSON 格式输出（便于 ELK 解析）
    core := zapcore.NewCore(
        zapcore.NewJSONEncoder(encoderConfig),
        zapcore.AddSync(os.Stdout),
        zap.InfoLevel,
    )
    
    Logger = zap.New(core, zap.AddCaller(), zap.AddStacktrace(zapcore.ErrorLevel))
}

// 使用示例
func ExampleUsage() {
    Logger.Info("User login",
        zap.String("user_id", "12345"),
        zap.String("ip", "192.168.1.100"),
        zap.Duration("duration", 150*time.Millisecond),
    )
    
    Logger.Error("Payment failed",
        zap.String("order_id", "ORD123"),
        zap.Float64("amount", 99.99),
        zap.Error(fmt.Errorf("insufficient balance")),
    )
}

Filebeat 配置：

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    fields:
      app: myapp
      env: production
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

# 输出到 Elasticsearch
output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "app-logs-%{+yyyy.MM.dd}"

# 输出到 Logstash
# output.logstash:
#   hosts: ["logstash:5044"]

# 处理器
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - decode_json_fields:
      fields: ["message"]
      target: ""
      overwrite_keys: true

Elasticsearch 索引模板：

{
  "index_patterns": ["app-logs-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "5s"
  },
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "message": { "type": "text" },
      "user_id": { "type": "keyword" },
      "order_id": { "type": "keyword" },
      "ip": { "type": "ip" },
      "duration": { "type": "float" },
      "error": { "type": "text" }
    }
  }
}

3.4 V3: 链路追踪（Jaeger）

// ==================== 链路追踪 ====================

package tracing

import (
    "context"
    "io"
    "time"
    
    "github.com/opentracing/opentracing-go"
    "github.com/opentracing/opentracing-go/ext"
    "github.com/uber/jaeger-client-go"
    jaegercfg "github.com/uber/jaeger-client-go/config"
)

// InitTracer 初始化 Tracer
func InitTracer(serviceName string) (opentracing.Tracer, io.Closer, error) {
    cfg := jaegercfg.Configuration{
        ServiceName: serviceName,
        Sampler: &jaegercfg.SamplerConfig{
            Type:  jaeger.SamplerTypeConst,
            Param: 1, // 采样率 100%
        },
        Reporter: &jaegercfg.ReporterConfig{
            LogSpans:            true,
            BufferFlushInterval: 1 * time.Second,
            LocalAgentHostPort:  "localhost:6831",
        },
    }
    
    tracer, closer, err := cfg.NewTracer()
    if err != nil {
        return nil, nil, err
    }
    
    opentracing.SetGlobalTracer(tracer)
    return tracer, closer, nil
}

// TracingMiddleware HTTP 追踪中间件
func TracingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tracer := opentracing.GlobalTracer()
        
        // 从 HTTP Header 提取 SpanContext
        spanCtx, _ := tracer.Extract(
            opentracing.HTTPHeaders,
            opentracing.HTTPHeadersCarrier(r.Header),
        )
        
        // 创建 Span
        span := tracer.StartSpan(
            r.URL.Path,
            ext.RPCServerOption(spanCtx),
        )
        defer span.Finish()
        
        // 设置标签
        ext.HTTPMethod.Set(span, r.Method)
        ext.HTTPUrl.Set(span, r.URL.String())
        
        // 将 Span 注入到 Context
        ctx := opentracing.ContextWithSpan(r.Context(), span)
        
        // 继续处理请求
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

// 业务代码示例
func CreateOrder(ctx context.Context, orderID string) error {
    // 创建子 Span
    span, ctx := opentracing.StartSpanFromContext(ctx, "CreateOrder")
    defer span.Finish()
    
    span.SetTag("order_id", orderID)
    
    // 调用库存服务
    if err := checkInventory(ctx, orderID); err != nil {
        ext.Error.Set(span, true)
        span.LogKV("event", "error", "message", err.Error())
        return err
    }
    
    // 调用支付服务
    if err := processPayment(ctx, orderID); err != nil {
        ext.Error.Set(span, true)
        return err
    }
    
    return nil
}

func checkInventory(ctx context.Context, orderID string) error {
    span, _ := opentracing.StartSpanFromContext(ctx, "CheckInventory")
    defer span.Finish()
    
    // 模拟 RPC 调用
    time.Sleep(50 * time.Millisecond)
    
    span.SetTag("inventory_available", true)
    return nil
}

func processPayment(ctx context.Context, orderID string) error {
    span, _ := opentracing.StartSpanFromContext(ctx, "ProcessPayment")
    defer span.Finish()
    
    // 模拟支付处理
    time.Sleep(100 * time.Millisecond)
    
    span.SetTag("payment_status", "success")
    return nil
}

4. PromQL 查询示例

4.1 基础查询

# 查询当前值
http_requests_total

# 过滤标签
http_requests_total{method="GET", status="200"}

# 正则匹配
http_requests_total{endpoint=~"/api/.*"}

# 范围查询（过去 5 分钟）
http_requests_total[5m]

# 速率计算（每秒请求数）
rate(http_requests_total[5m])

# 总和
sum(http_requests_total) by (endpoint)

# 平均值
avg(http_request_duration_seconds) by (endpoint)

# 分位数
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

4.2 常用监控查询

# QPS
sum(rate(http_requests_total[1m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

# P99 延迟
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU 使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 
node_memory_MemTotal_bytes * 100

# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_free_bytes) / 
node_filesystem_size_bytes * 100

# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

5. Grafana Dashboard 配置

{
  "dashboard": {
    "title": "Application Monitoring",
    "panels": [
      {
        "title": "QPS",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
            "legendFormat": "Error Rate"
          }
        ],
        "type": "graph",
        "yaxes": [
          {
            "format": "percentunit"
          }
        ]
      },
      {
        "title": "P99 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

6. 面试问答（10个高频题）

Prometheus 的数据模型是什么？

答：

核心概念：

Metric Name：指标名称（如 http_requests_total）
Labels：标签键值对（如 {method="GET", status="200"}）
Timestamp：时间戳
Value：浮点数值

指标类型：

Counter：单调递增（如请求总数）
Gauge：可增可减（如内存使用量）
Histogram：分布统计（如延迟分布）
Summary：分位数统计（如 P99 延迟）

示例：

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234 @1699876200

Pull 和 Push 模型有什么区别？

答：

对比项	Pull（Prometheus）	Push（传统监控）
主动性	服务端拉取	客户端推送
网络	服务端控制，稳定	客户端控制，易失败
服务发现	容易（抓取目标）	困难（需要注册）
调试	容易（/metrics 端点）	困难
短任务	不适合（需要 Push Gateway）	适合

选择建议：

长期运行的服务：Pull（Prometheus）
短期任务、批处理：Push（Push Gateway）

如何设计告警规则？

答：

原则：

可操作性：告警必须需要人工介入
准确性：避免误报
及时性：尽早发现问题
优先级：区分 Critical / Warning

常见告警：

可用性：

# 服务宕机
up == 0

性能：

# 高延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1

# 高错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

资源：

# 高 CPU
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

# 高内存
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85

业务：

# 订单量下降
rate(business_order_total[5m]) < 10

如何防止告警风暴？

答：

方法：

1. 告警聚合（Group By）

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s

2. 告警抑制（Inhibit）

inhibit_rules:
  # 如果实例宕机，抑制该实例的其他告警
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      instance: '.*'
    equal: ['instance']

3. 告警静默（Silence）

# 维护期间静默告警
amtool silence add alertname=HighCPU --duration=2h

4. 告警收敛（Repeat Interval）

route:
  repeat_interval: 12h  # 12小时内相同告警只发一次

监控系统如何保证高可用？

答：

Prometheus 高可用：

方案1：双活部署

Prometheus-1 ──┐
               ├──> Target (App)
Prometheus-2 ──┘

两个 Prometheus 同时抓取相同目标

优点：简单缺点：数据冗余，查询需要去重

方案2：联邦集群

Global Prometheus
      ↓
  ┌───┴───┐
  ↓       ↓
Prom-1  Prom-2
  ↓       ↓
App-1   App-2

配置：

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="app"}'
    static_configs:
      - targets:
          - 'prometheus-1:9090'
          - 'prometheus-2:9090'

方案3：远程存储

Prometheus ──> Remote Write ──> Thanos/VictoriaMetrics

如何监控 Kubernetes？

答：

核心指标：

1. 节点指标：

# 节点 CPU
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)

# 节点内存
sum(container_memory_working_set_bytes{container!=""}) by (node)

2. Pod 指标：

# Pod CPU
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

# Pod 内存
sum(container_memory_working_set_bytes{pod=~"app-.*"}) by (pod)

3. 集群指标：

# Pod 数量
count(kube_pod_info)

# 不健康 Pod
count(kube_pod_status_phase{phase!="Running"})

服务发现：

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

日志和指标有什么区别？

答：

对比项	日志（Logs）	指标（Metrics）
数据类型	离散事件	聚合数据
存储	大（TB级）	小（GB级）
查询	全文搜索	时序查询
实时性	一般	高
成本	高	低
用途	问题排查	趋势分析、告警

使用场景：

指标：监控系统健康、性能趋势、触发告警
日志：问题排查、审计、详细上下文
追踪：分布式调用链分析

结合使用：

告警触发（Metrics）→ 查看日志（Logs）→ 分析调用链（Traces）

如何降低监控成本？

答：

1. 数据采样：

# 降低采集频率
scrape_interval: 60s  # 从 15s 改为 60s

2. 数据保留：

# 缩短保留时间
--storage.tsdb.retention.time=7d  # 从 15d 改为 7d

3. 降采样：

# 原始数据 → 聚合数据
sum(rate(http_requests_total[5m])) by (endpoint)

4. 远程存储：

热数据（7天）：Prometheus 本地
冷数据（90天）：对象存储（S3）

5. 标签优化：

// 避免高基数标签
//  不好：user_id 作为标签（百万级）
http_requests_total{user_id="12345"}

//  好：放在日志中
log.Info("request", zap.String("user_id", "12345"))

链路追踪如何实现？

答：

核心概念：

Trace：一次完整请求
Span：一个操作单元
SpanContext：追踪上下文（TraceID、SpanID）

实现原理：

1. 生成 TraceID：

TraceID: 64位唯一ID
SpanID: 64位唯一ID
ParentSpanID: 父 Span ID

2. 传递 Context：

// HTTP Header
X-Trace-ID: abc123
X-Span-ID: def456
X-Parent-Span-ID: 789

// gRPC Metadata
metadata.New(map[string]string{
    "trace-id": "abc123",
    "span-id": "def456",
})

3. 记录 Span：

span := tracer.StartSpan("CreateOrder")
span.SetTag("order_id", "ORD123")
span.SetTag("amount", 99.99)
span.Finish()

4. 上报到后端：

Agent ──> Collector ──> Storage ──> UI

如何设计监控系统的架构？

答：

分层架构：

┌─────────────────────────────────────────┐
│           采集层                         │
│  Exporters / Agents / SDKs              │
└──────────────┬──────────────────────────┘
               
┌─────────────────────────────────────────┐
│           传输层                         │
│  Kafka / Fluentd / Vector               │
└──────────────┬──────────────────────────┘
               
┌─────────────────────────────────────────┐
│           存储层                         │
│  Prometheus / ES / ClickHouse           │
└──────────────┬──────────────────────────┘
               
┌─────────────────────────────────────────┐
│           查询层                         │
│  Grafana / Kibana / Jaeger UI          │
└─────────────────────────────────────────┘

技术选型：

类型	开源方案	商业方案
指标	Prometheus + Grafana	Datadog, New Relic
日志	ELK / Loki	Splunk, Sumo Logic
追踪	Jaeger / Zipkin	Datadog APM
统一	-	Datadog, Dynatrace

推荐方案：

小团队：Prometheus + Loki + Tempo（Grafana 全家桶）
大团队：自建 + 商业监控（混合）

7. 总结

核心要点

监控类型
- Metrics：指标、趋势、告警
- Logs：事件、排查
- Traces：调用链、性能
Prometheus
- Pull 模型
- PromQL 查询
- 告警规则
- 服务发现
告警设计
- 可操作性
- 避免误报
- 防止告警风暴
- 分级处理
高可用
- 双活部署
- 联邦集群
- 远程存储

最佳实践

指标命名：<namespace>_<subsystem>_<metric>_<unit>
标签设计：避免高基数
告警规则：先宽后严，逐步优化
数据保留：7天热数据 + 90天冷数据
成本优化：降采样、远程存储、标签优化

本章完，祝面试顺利！