集群架构

一、节点角色

1.1 节点角色类型

Elasticsearch 7.x后引入了更细粒度的节点角色划分。

核心角色:

1. Master节点(集群管理):

# elasticsearch.yml
node.roles: [ master ]

职责:
- 管理集群状态(cluster state)
- 创建/删除索引
- 分配分片到节点
- 维护节点加入/离开

特点:
- 不处理数据读写
- 轻量级,内存需求小
- 建议:3个master节点(奇数)

2. Data节点(数据存储):

node.roles: [ data ]

职责:
- 存储数据
- 执行CRUD操作
- 执行搜索和聚合

特点:
- IO和内存密集
- 可细分为data_hot、data_warm、data_cold

3. Coordinating节点(协调节点):

node.roles: [ ]  # 空角色,专职协调

职责:
- 接收客户端请求
- 将请求路由到数据节点
- 汇总各节点结果
- 返回给客户端

特点:
- 不存储数据
- 适合做负载均衡

4. Ingest节点(数据预处理):

node.roles: [ ingest ]

职责:
- 数据写入前的预处理
- 执行Pipeline(解析、转换、enrichment)

示例:
- 解析日志格式
- IP地址转地理位置
- 时间格式转换

5. Machine Learning节点:

node.roles: [ ml ]

职责:
- 异常检测
- 机器学习任务

6. Transform节点:

node.roles: [ transform ]

职责:
- 数据转换
- 聚合计算

1.2 节点角色组合

小型集群(< 10节点):

# 所有节点都是master + data + ingest
node.roles: [ master, data, ingest ]

优点: 简单,资源利用率高
缺点: 角色冲突,性能不稳定

中型集群(10-50节点):

# 专用master节点(3个)
node.roles: [ master ]

# 数据节点
node.roles: [ data, ingest ]

优点: master独立,集群稳定
缺点: 需要更多硬件

大型集群(> 50节点):

# 专用master节点(3个)
node.roles: [ master ]

# 专用coordinating节点(2-3个)
node.roles: [ ]

# Hot数据节点
node.roles: [ data_hot, ingest ]

# Warm数据节点
node.roles: [ data_warm ]

# Cold数据节点
node.roles: [ data_cold ]

优点: 角色分离,性能最优
缺点: 架构复杂,成本高

1.3 节点角色最佳实践

Master节点配置:

# elasticsearch.yml
node.roles: [ master ]
node.name: master-1

# 发现配置(3个master)
discovery.seed_hosts: ["master-1", "master-2", "master-3"]
cluster.initial_master_nodes: ["master-1", "master-2", "master-3"]

# 硬件建议
CPU: 2-4核
内存: 4-8GB
磁盘: 普通SATA即可

Data Hot节点配置:

node.roles: [ data_hot, ingest ]
node.attr.box_type: hot

# 硬件建议
CPU: 8-16核
内存: 64GB+
磁盘: NVMe SSD

Data Warm节点配置:

node.roles: [ data_warm ]
node.attr.box_type: warm

# 硬件建议
CPU: 4-8核
内存: 32GB
磁盘: SATA SSD

Coordinating节点配置:

node.roles: [ ]

# 硬件建议
CPU: 8核
内存: 16-32GB
磁盘: 不存数据,小容量即可

二、分片分配

2.1 分片分配原理

路由计算:

# 写入时
shard_id = hash(_routing) % number_of_primary_shards

# 默认routing = _id
shard_id = hash(document_id) % 3

# 示例
document_id = "product-123"
hash("product-123") = 987654321
987654321 % 3 = 0
→ 文档分配到 shard 0

副本分配原则:

主分片和副本不在同一节点
同一分片的多个副本分散到不同节点
考虑机架/可用区感知

分片分配示例:

集群: 3节点,索引products(3主分片,1副本)

节点1: P0, R1, R2
节点2: P1, R0, R2
节点3: P2, R0, R1

P = Primary Shard
R = Replica Shard

2.2 分片分配策略

Shard Allocation Filtering:

// 场景1:热温冷架构
PUT /logs-hot/_settings
{
  "index.routing.allocation.require.box_type": "hot"
}

PUT /logs-warm/_settings
{
  "index.routing.allocation.require.box_type": "warm"
}

// 场景2:排除特定节点
PUT /products/_settings
{
  "index.routing.allocation.exclude._name": "node-1"
}

// 场景3:机架感知
PUT /products/_settings
{
  "index.routing.allocation.include.rack": "rack1"
}

Cluster-level Shard Allocation:

// 禁用所有分片分配(维护时)
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "none"
  }
}

// 只允许主分片分配
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

// 恢复正常
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

Rebalancing(分片再平衡):

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.rebalance.enable": "all",  // all/primaries/replicas/none
    "cluster.routing.allocation.allow_rebalance": "indices_all_active",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2  // 并发数
  }
}

2.3 分片分配问题排查

查看未分配分片:

# 查看所有分片状态
GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason

# 查看未分配原因
GET /_cluster/allocation/explain
{
  "index": "products",
  "shard": 0,
  "primary": true
}

常见未分配原因:

{
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_name": "node-1",
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

解决方案:

// 1. 增加节点
// 2. 减少副本数
PUT /products/_settings
{
  "number_of_replicas": 0
}

// 3. 强制分配(危险,可能丢数据)
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "products",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}

2.4 Awareness属性

机架感知(Rack Awareness):

# node-1.yml (rack1)
node.attr.rack_id: rack1

# node-2.yml (rack1)
node.attr.rack_id: rack1

# node-3.yml (rack2)
node.attr.rack_id: rack2

// 配置感知
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "rack_id"
  }
}

// 结果:主副本不在同一机架
Rack1: P0, P1, R2
Rack2: P2, R0, R1

可用区感知(Availability Zone):

# node-1.yml
node.attr.zone: us-east-1a

# node-2.yml
node.attr.zone: us-east-1b

# node-3.yml
node.attr.zone: us-east-1c

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": [
      "us-east-1a",
      "us-east-1b",
      "us-east-1c"
    ]
  }
}

三、脑裂问题

3.1 什么是脑裂

脑裂(Split-Brain):网络分区导致集群分裂成多个子集群,每个都认为自己是主集群。

场景:

原始集群: Master-1, Master-2, Master-3, Data-1, Data-2

网络分区后:
子集群A: Master-1, Data-1         (1个master节点)
子集群B: Master-2, Master-3, Data-2 (2个master节点)

结果:
- 子集群B选举Master-2为leader
- 两个子集群独立运行
- 数据不一致

危害:

数据不一致
写入数据丢失
索引元数据冲突

3.2 脑裂预防

核心机制:quorum(法定人数)

minimum_master_nodes(ES 7.x前):

PUT /_cluster/settings
{
  "persistent": {
    "discovery.zen.minimum_master_nodes": 2  // (master_eligible_nodes / 2) + 1
  }
}

// 3个master节点: (3/2) + 1 = 2
// 5个master节点: (5/2) + 1 = 3

ES 7.x后:自动quorum:

# elasticsearch.yml
cluster.initial_master_nodes: ["master-1", "master-2", "master-3"]

# 首次启动后自动计算quorum,无需手动配置

最佳实践:

Master节点数量:
- 1个: 不支持HA,不推荐生产环境
- 2个: 无法满足quorum,不推荐
- 3个: 推荐,允许1个节点故障
- 5个: 大集群,允许2个节点故障
- 7个: 超大集群,允许3个节点故障

原则: 奇数个,至少3个

3.3 脑裂检测

监控指标:

# 查看master节点
GET /_cat/master?v

# 查看集群健康
GET /_cluster/health
{
  "cluster_name": "my-cluster",
  "status": "green",
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 15,
  "active_shards": 30,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0  // > 0表示有问题
}

# 查看节点信息
GET /_cat/nodes?v

# 如果发现多个cluster_name,说明发生脑裂

脑裂恢复:

# 1. 停止所有节点
systemctl stop elasticsearch

# 2. 修复网络问题

# 3. 按顺序启动master节点
systemctl start elasticsearch  # master-1
systemctl start elasticsearch  # master-2
systemctl start elasticsearch  # master-3

# 4. 启动数据节点
systemctl start elasticsearch  # data节点

3.4 网络分区容错

配置超时:

# elasticsearch.yml

# 节点间ping超时
discovery.zen.fd.ping_timeout: 30s        # 默认30s
discovery.zen.fd.ping_retries: 3          # 默认3次
discovery.zen.fd.ping_interval: 1s        # ping间隔

# master选举超时
cluster.fault_detection.leader_check.timeout: 10s
cluster.fault_detection.follower_check.timeout: 10s

避免假死检测:

# 增加超时,避免GC导致误判
discovery.zen.fd.ping_timeout: 60s

四、集群监控

4.1 集群健康监控

健康状态:

GET /_cluster/health

{
  "status": "green",  // green/yellow/red
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 15,
  "active_shards": 30,
  "unassigned_shards": 0
}

状态含义:

状态	含义	影响	处理
green	所有主分片和副本都已分配	无	正常
yellow	所有主分片已分配,部分副本未分配	可用性降低	检查节点数量
red	部分主分片未分配	数据丢失	紧急处理

索引级健康:

GET /_cluster/health?level=indices

GET /_cluster/health/products?level=shards

4.2 节点监控

节点统计:

GET /_nodes/stats

{
  "nodes": {
    "node-1": {
      "indices": {
        "docs": { "count": 1000000 },
        "store": { "size_in_bytes": 5368709120 },
        "indexing": {
          "index_total": 1000000,
          "index_time_in_millis": 300000
        },
        "search": {
          "query_total": 500000,
          "query_time_in_millis": 120000
        }
      },
      "jvm": {
        "mem": {
          "heap_used_percent": 65,
          "heap_max_in_bytes": 2147483648
        },
        "gc": {
          "collectors": {
            "young": { "collection_count": 1000 },
            "old": { "collection_count": 10 }
          }
        }
      },
      "os": {
        "cpu": { "percent": 45 },
        "mem": { "used_percent": 80 }
      },
      "fs": {
        "total": {
          "total_in_bytes": 107374182400,
          "free_in_bytes": 53687091200,
          "available_in_bytes": 53687091200
        }
      }
    }
  }
}

关键指标:

1. JVM堆内存使用率:
   - < 75%: 正常
   - 75%-85%: 警告
   - > 85%: 危险,需扩容

2. GC频率:
   - Old GC频繁: 内存不足
   - Young GC停顿久: 堆太大

3. CPU使用率:
   - < 70%: 正常
   - > 90%: 需优化或扩容

4. 磁盘使用率:
   - < 85%: 正常
   - > 85%: 开始触发watermark
   - > 95%: 只读模式

4.3 性能监控

慢查询日志:

PUT /_cluster/settings
{
  "persistent": {
    "logger.index.search.slowlog": "DEBUG",
    "logger.index.indexing.slowlog": "DEBUG"
  }
}

PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.threshold.query.info": "1s",
  "index.search.slowlog.threshold.query.debug": "500ms",

  "index.indexing.slowlog.threshold.index.warn": "2s",
  "index.indexing.slowlog.threshold.index.info": "1s"
}

Hot Threads(CPU占用高的线程):

GET /_nodes/hot_threads
GET /_nodes/node-1/hot_threads

Task Management:

# 查看正在运行的任务
GET /_tasks

# 取消长时间运行的任务
POST /_tasks/{task_id}/_cancel

4.4 集群监控工具

1. Elasticsearch自带监控:

# 开启监控(ES 7.x+)
PUT /_cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.enabled": true
  }
}

2. Kibana监控:

可视化集群健康
节点性能指标
索引统计

3. Prometheus + Grafana:

# elasticsearch_exporter
docker run -d \
  --name elasticsearch_exporter \
  -p 9114:9114 \
  prometheuscommunity/elasticsearch-exporter:latest \
  --es.uri=http://elasticsearch:9200

4. 自定义监控脚本:

from elasticsearch import Elasticsearch

es = Elasticsearch(['http://localhost:9200'])

# 监控集群健康
health = es.cluster.health()
if health['status'] == 'red':
    alert("Cluster is RED!")

# 监控JVM堆内存
stats = es.nodes.stats()
for node_id, node in stats['nodes'].items():
    heap_percent = node['jvm']['mem']['heap_used_percent']
    if heap_percent > 85:
        alert(f"Node {node['name']} heap is {heap_percent}%")

# 监控磁盘使用
for node_id, node in stats['nodes'].items():
    fs = node['fs']['total']
    disk_percent = (fs['total_in_bytes'] - fs['available_in_bytes']) / fs['total_in_bytes'] * 100
    if disk_percent > 85:
        alert(f"Node {node['name']} disk is {disk_percent}%")

4.5 告警规则

关键告警:

# Prometheus告警规则
groups:
  - name: elasticsearch
    rules:
      - alert: ClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 5m
        annotations:
          summary: "Cluster status is RED"

      - alert: ClusterYellow
        expr: elasticsearch_cluster_health_status{color="yellow"} == 1
        for: 15m
        annotations:
          summary: "Cluster status is YELLOW for 15 minutes"

      - alert: HighJVMMemory
        expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
        for: 10m
        annotations:
          summary: "JVM heap usage > 85%"

      - alert: HighCPU
        expr: elasticsearch_os_cpu_percent > 90
        for: 10m
        annotations:
          summary: "CPU usage > 90%"

      - alert: LowDiskSpace
        expr: (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_available_bytes) / elasticsearch_filesystem_data_size_bytes > 0.85
        for: 5m
        annotations:
          summary: "Disk usage > 85%"

      - alert: TooManyPendingTasks
        expr: elasticsearch_cluster_health_number_of_pending_tasks > 100
        for: 10m
        annotations:
          summary: "Too many pending tasks"

五、高频面试题

为什么Master节点要奇数个?

答案: 基于quorum机制(法定人数)防止脑裂。

quorum公式:

quorum = (master_eligible_nodes / 2) + 1

3个master: quorum = 2,允许1个故障
4个master: quorum = 3,允许1个故障 (浪费资源)
5个master: quorum = 3,允许2个故障

奇数vs偶数:

3个和4个容错能力一样,但4个浪费资源
5个和6个容错能力一样,6个浪费资源

结论:奇数个更经济,推荐3或5个。

如何避免脑裂?

答案:

配置quorum:
- ES 7.x前:discovery.zen.minimum_master_nodes
- ES 7.x后:自动计算
至少3个Master节点:
- 2个无法满足quorum
- 3个允许1个故障
网络优化:
- 专用网络
- 低延迟
- 避免跨机房
合理超时:
- 增加ping_timeout
- 避免GC导致误判

集群状态为Yellow怎么办?

原因:副本分片未分配。

排查步骤:

# 1. 查看未分配分片
GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason

# 2. 查看具体原因
GET /_cluster/allocation/explain

# 3. 常见原因
- 节点数不足(副本数>节点数-1)
- 磁盘水位超标
- 分片分配策略限制

解决方案:

// 方案1:增加节点

// 方案2:减少副本数
PUT /products/_settings
{
  "number_of_replicas": 0
}

// 方案3:调整磁盘水位
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%"
  }
}

主分片数能修改吗?为什么?

答案:不能。

原因:路由算法依赖主分片数。

shard_id = hash(document_id) % number_of_primary_shards

# 如果修改主分片数,路由结果变化,文档无法定位
原来: hash("doc-1") % 3 = 1 → shard 1
修改后: hash("doc-1") % 5 = 3 → shard 3 (找不到文档!)

解决方案:Reindex到新索引。

POST /_reindex
{
  "source": {
    "index": "old_index"
  },
  "dest": {
    "index": "new_index"
  }
}

// 使用别名切换
POST /_aliases
{
  "actions": [
    { "remove": { "index": "old_index", "alias": "products" } },
    { "add": { "index": "new_index", "alias": "products" } }
  ]
}

如何实现集群的高可用?

答案:

1. 节点层面:

至少3个Master节点(奇数)
每个索引至少1个副本
跨机架/可用区部署

2. 网络层面:

专用内网
冗余网络链路
负载均衡器(多Coordinating节点)

3. 数据层面:

定期备份(snapshot)
多副本(replicas)
跨集群复制(CCR)

4. 监控告警:

集群健康监控
节点资源监控
自动告警

架构示例:

负载均衡器
    ↓
Coordinating节点(2个)
    ↓
Master节点(3个,奇数)
    ↓
Data Hot节点(3个,AZ1/AZ2/AZ3)
Data Warm节点(3个,AZ1/AZ2/AZ3)

配置:
- number_of_replicas: 1
- cluster.routing.allocation.awareness.attributes: zone

六、实战技巧

6.1 滚动重启集群

场景:升级ES版本或修改配置,需重启集群。

步骤:

# 1. 禁用分片分配
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

# 2. 停止一个节点
systemctl stop elasticsearch

# 3. 升级或修改配置
yum update elasticsearch
vi /etc/elasticsearch/elasticsearch.yml

# 4. 启动节点
systemctl start elasticsearch

# 5. 等待节点加入集群
GET /_cat/nodes?v

# 6. 重复2-5,逐个重启其他节点

# 7. 恢复分片分配
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

6.2 磁盘水位管理

水位阈值:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.threshold_enabled": true,
    "cluster.routing.allocation.disk.watermark.low": "85%",    // 低水位
    "cluster.routing.allocation.disk.watermark.high": "90%",   // 高水位
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"  // 洪水位
  }
}

水位触发行为:

低水位(85%):不分配新分片到该节点
高水位(90%):尝试将分片迁移到其他节点
洪水位(95%):索引变为只读

只读恢复:

PUT /products/_settings
{
  "index.blocks.read_only_allow_delete": null
}

6.3 备份恢复

配置Repository:

PUT /_snapshot/my_backup
{
  "type": "fs",
  "settings": {
    "location": "/mount/backups/my_backup",
    "compress": true
  }
}

创建快照:

PUT /_snapshot/my_backup/snapshot_1
{
  "indices": "products,orders",
  "ignore_unavailable": true,
  "include_global_state": false
}

恢复快照:

POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_$1"
}

6.4 跨集群搜索

// 配置远程集群
PUT /_cluster/settings
{
  "persistent": {
    "cluster.remote.cluster_two.seeds": [
      "10.0.1.1:9300",
      "10.0.1.2:9300"
    ]
  }
}

// 跨集群搜索
GET /local_index,cluster_two:remote_index/_search
{
  "query": {
    "match": { "title": "elasticsearch" }
  }
}