集群架构
一、节点角色
1.1 节点角色类型
Elasticsearch 7.x后引入了更细粒度的节点角色划分。
核心角色:
1. Master节点(集群管理):
# elasticsearch.yml
node.roles: [ master ]
职责:
- 管理集群状态(cluster state)
- 创建/删除索引
- 分配分片到节点
- 维护节点加入/离开
特点:
- 不处理数据读写
- 轻量级,内存需求小
- 建议:3个master节点(奇数)
2. Data节点(数据存储):
node.roles: [ data ]
职责:
- 存储数据
- 执行CRUD操作
- 执行搜索和聚合
特点:
- IO和内存密集
- 可细分为data_hot、data_warm、data_cold
3. Coordinating节点(协调节点):
node.roles: [ ] # 空角色,专职协调
职责:
- 接收客户端请求
- 将请求路由到数据节点
- 汇总各节点结果
- 返回给客户端
特点:
- 不存储数据
- 适合做负载均衡
4. Ingest节点(数据预处理):
node.roles: [ ingest ]
职责:
- 数据写入前的预处理
- 执行Pipeline(解析、转换、enrichment)
示例:
- 解析日志格式
- IP地址转地理位置
- 时间格式转换
5. Machine Learning节点:
node.roles: [ ml ]
职责:
- 异常检测
- 机器学习任务
6. Transform节点:
node.roles: [ transform ]
职责:
- 数据转换
- 聚合计算
1.2 节点角色组合
小型集群(< 10节点):
# 所有节点都是master + data + ingest
node.roles: [ master, data, ingest ]
优点: 简单,资源利用率高
缺点: 角色冲突,性能不稳定
中型集群(10-50节点):
# 专用master节点(3个)
node.roles: [ master ]
# 数据节点
node.roles: [ data, ingest ]
优点: master独立,集群稳定
缺点: 需要更多硬件
大型集群(> 50节点):
# 专用master节点(3个)
node.roles: [ master ]
# 专用coordinating节点(2-3个)
node.roles: [ ]
# Hot数据节点
node.roles: [ data_hot, ingest ]
# Warm数据节点
node.roles: [ data_warm ]
# Cold数据节点
node.roles: [ data_cold ]
优点: 角色分离,性能最优
缺点: 架构复杂,成本高
1.3 节点角色最佳实践
Master节点配置:
# elasticsearch.yml
node.roles: [ master ]
node.name: master-1
# 发现配置(3个master)
discovery.seed_hosts: ["master-1", "master-2", "master-3"]
cluster.initial_master_nodes: ["master-1", "master-2", "master-3"]
# 硬件建议
CPU: 2-4核
内存: 4-8GB
磁盘: 普通SATA即可
Data Hot节点配置:
node.roles: [ data_hot, ingest ]
node.attr.box_type: hot
# 硬件建议
CPU: 8-16核
内存: 64GB+
磁盘: NVMe SSD
Data Warm节点配置:
node.roles: [ data_warm ]
node.attr.box_type: warm
# 硬件建议
CPU: 4-8核
内存: 32GB
磁盘: SATA SSD
Coordinating节点配置:
node.roles: [ ]
# 硬件建议
CPU: 8核
内存: 16-32GB
磁盘: 不存数据,小容量即可
二、分片分配
2.1 分片分配原理
路由计算:
# 写入时
shard_id = hash(_routing) % number_of_primary_shards
# 默认routing = _id
shard_id = hash(document_id) % 3
# 示例
document_id = "product-123"
hash("product-123") = 987654321
987654321 % 3 = 0
→ 文档分配到 shard 0
副本分配原则:
- 主分片和副本不在同一节点
- 同一分片的多个副本分散到不同节点
- 考虑机架/可用区感知
分片分配示例:
集群: 3节点,索引products(3主分片,1副本)
节点1: P0, R1, R2
节点2: P1, R0, R2
节点3: P2, R0, R1
P = Primary Shard
R = Replica Shard
2.2 分片分配策略
Shard Allocation Filtering:
// 场景1:热温冷架构
PUT /logs-hot/_settings
{
"index.routing.allocation.require.box_type": "hot"
}
PUT /logs-warm/_settings
{
"index.routing.allocation.require.box_type": "warm"
}
// 场景2:排除特定节点
PUT /products/_settings
{
"index.routing.allocation.exclude._name": "node-1"
}
// 场景3:机架感知
PUT /products/_settings
{
"index.routing.allocation.include.rack": "rack1"
}
Cluster-level Shard Allocation:
// 禁用所有分片分配(维护时)
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
// 只允许主分片分配
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}
// 恢复正常
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
}
Rebalancing(分片再平衡):
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.rebalance.enable": "all", // all/primaries/replicas/none
"cluster.routing.allocation.allow_rebalance": "indices_all_active",
"cluster.routing.allocation.cluster_concurrent_rebalance": 2 // 并发数
}
}
2.3 分片分配问题排查
查看未分配分片:
# 查看所有分片状态
GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason
# 查看未分配原因
GET /_cluster/allocation/explain
{
"index": "products",
"shard": 0,
"primary": true
}
常见未分配原因:
{
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions": [
{
"node_name": "node-1",
"node_decision": "no",
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "a copy of this shard is already allocated to this node"
}
]
}
]
}
解决方案:
// 1. 增加节点
// 2. 减少副本数
PUT /products/_settings
{
"number_of_replicas": 0
}
// 3. 强制分配(危险,可能丢数据)
POST /_cluster/reroute
{
"commands": [
{
"allocate_empty_primary": {
"index": "products",
"shard": 0,
"node": "node-1",
"accept_data_loss": true
}
}
]
}
2.4 Awareness属性
机架感知(Rack Awareness):
# node-1.yml (rack1)
node.attr.rack_id: rack1
# node-2.yml (rack1)
node.attr.rack_id: rack1
# node-3.yml (rack2)
node.attr.rack_id: rack2
// 配置感知
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "rack_id"
}
}
// 结果:主副本不在同一机架
Rack1: P0, P1, R2
Rack2: P2, R0, R1
可用区感知(Availability Zone):
# node-1.yml
node.attr.zone: us-east-1a
# node-2.yml
node.attr.zone: us-east-1b
# node-3.yml
node.attr.zone: us-east-1c
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "zone",
"cluster.routing.allocation.awareness.force.zone.values": [
"us-east-1a",
"us-east-1b",
"us-east-1c"
]
}
}
三、脑裂问题
3.1 什么是脑裂
脑裂(Split-Brain):网络分区导致集群分裂成多个子集群,每个都认为自己是主集群。
场景:
原始集群: Master-1, Master-2, Master-3, Data-1, Data-2
网络分区后:
子集群A: Master-1, Data-1 (1个master节点)
子集群B: Master-2, Master-3, Data-2 (2个master节点)
结果:
- 子集群B选举Master-2为leader
- 两个子集群独立运行
- 数据不一致
危害:
- 数据不一致
- 写入数据丢失
- 索引元数据冲突
3.2 脑裂预防
核心机制:quorum(法定人数)
minimum_master_nodes(ES 7.x前):
PUT /_cluster/settings
{
"persistent": {
"discovery.zen.minimum_master_nodes": 2 // (master_eligible_nodes / 2) + 1
}
}
// 3个master节点: (3/2) + 1 = 2
// 5个master节点: (5/2) + 1 = 3
ES 7.x后:自动quorum:
# elasticsearch.yml
cluster.initial_master_nodes: ["master-1", "master-2", "master-3"]
# 首次启动后自动计算quorum,无需手动配置
最佳实践:
Master节点数量:
- 1个: 不支持HA,不推荐生产环境
- 2个: 无法满足quorum,不推荐
- 3个: 推荐,允许1个节点故障
- 5个: 大集群,允许2个节点故障
- 7个: 超大集群,允许3个节点故障
原则: 奇数个,至少3个
3.3 脑裂检测
监控指标:
# 查看master节点
GET /_cat/master?v
# 查看集群健康
GET /_cluster/health
{
"cluster_name": "my-cluster",
"status": "green",
"number_of_nodes": 5,
"number_of_data_nodes": 3,
"active_primary_shards": 15,
"active_shards": 30,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0 // > 0表示有问题
}
# 查看节点信息
GET /_cat/nodes?v
# 如果发现多个cluster_name,说明发生脑裂
脑裂恢复:
# 1. 停止所有节点
systemctl stop elasticsearch
# 2. 修复网络问题
# 3. 按顺序启动master节点
systemctl start elasticsearch # master-1
systemctl start elasticsearch # master-2
systemctl start elasticsearch # master-3
# 4. 启动数据节点
systemctl start elasticsearch # data节点
3.4 网络分区容错
配置超时:
# elasticsearch.yml
# 节点间ping超时
discovery.zen.fd.ping_timeout: 30s # 默认30s
discovery.zen.fd.ping_retries: 3 # 默认3次
discovery.zen.fd.ping_interval: 1s # ping间隔
# master选举超时
cluster.fault_detection.leader_check.timeout: 10s
cluster.fault_detection.follower_check.timeout: 10s
避免假死检测:
# 增加超时,避免GC导致误判
discovery.zen.fd.ping_timeout: 60s
四、集群监控
4.1 集群健康监控
健康状态:
GET /_cluster/health
{
"status": "green", // green/yellow/red
"number_of_nodes": 5,
"number_of_data_nodes": 3,
"active_primary_shards": 15,
"active_shards": 30,
"unassigned_shards": 0
}
状态含义:
| 状态 | 含义 | 影响 | 处理 |
|---|---|---|---|
| green | 所有主分片和副本都已分配 | 无 | 正常 |
| yellow | 所有主分片已分配,部分副本未分配 | 可用性降低 | 检查节点数量 |
| red | 部分主分片未分配 | 数据丢失 | 紧急处理 |
索引级健康:
GET /_cluster/health?level=indices
GET /_cluster/health/products?level=shards
4.2 节点监控
节点统计:
GET /_nodes/stats
{
"nodes": {
"node-1": {
"indices": {
"docs": { "count": 1000000 },
"store": { "size_in_bytes": 5368709120 },
"indexing": {
"index_total": 1000000,
"index_time_in_millis": 300000
},
"search": {
"query_total": 500000,
"query_time_in_millis": 120000
}
},
"jvm": {
"mem": {
"heap_used_percent": 65,
"heap_max_in_bytes": 2147483648
},
"gc": {
"collectors": {
"young": { "collection_count": 1000 },
"old": { "collection_count": 10 }
}
}
},
"os": {
"cpu": { "percent": 45 },
"mem": { "used_percent": 80 }
},
"fs": {
"total": {
"total_in_bytes": 107374182400,
"free_in_bytes": 53687091200,
"available_in_bytes": 53687091200
}
}
}
}
}
关键指标:
1. JVM堆内存使用率:
- < 75%: 正常
- 75%-85%: 警告
- > 85%: 危险,需扩容
2. GC频率:
- Old GC频繁: 内存不足
- Young GC停顿久: 堆太大
3. CPU使用率:
- < 70%: 正常
- > 90%: 需优化或扩容
4. 磁盘使用率:
- < 85%: 正常
- > 85%: 开始触发watermark
- > 95%: 只读模式
4.3 性能监控
慢查询日志:
PUT /_cluster/settings
{
"persistent": {
"logger.index.search.slowlog": "DEBUG",
"logger.index.indexing.slowlog": "DEBUG"
}
}
PUT /products/_settings
{
"index.search.slowlog.threshold.query.warn": "2s",
"index.search.slowlog.threshold.query.info": "1s",
"index.search.slowlog.threshold.query.debug": "500ms",
"index.indexing.slowlog.threshold.index.warn": "2s",
"index.indexing.slowlog.threshold.index.info": "1s"
}
Hot Threads(CPU占用高的线程):
GET /_nodes/hot_threads
GET /_nodes/node-1/hot_threads
Task Management:
# 查看正在运行的任务
GET /_tasks
# 取消长时间运行的任务
POST /_tasks/{task_id}/_cancel
4.4 集群监控工具
1. Elasticsearch自带监控:
# 开启监控(ES 7.x+)
PUT /_cluster/settings
{
"persistent": {
"xpack.monitoring.collection.enabled": true
}
}
2. Kibana监控:
- 可视化集群健康
- 节点性能指标
- 索引统计
3. Prometheus + Grafana:
# elasticsearch_exporter
docker run -d \
--name elasticsearch_exporter \
-p 9114:9114 \
prometheuscommunity/elasticsearch-exporter:latest \
--es.uri=http://elasticsearch:9200
4. 自定义监控脚本:
from elasticsearch import Elasticsearch
es = Elasticsearch(['http://localhost:9200'])
# 监控集群健康
health = es.cluster.health()
if health['status'] == 'red':
alert("Cluster is RED!")
# 监控JVM堆内存
stats = es.nodes.stats()
for node_id, node in stats['nodes'].items():
heap_percent = node['jvm']['mem']['heap_used_percent']
if heap_percent > 85:
alert(f"Node {node['name']} heap is {heap_percent}%")
# 监控磁盘使用
for node_id, node in stats['nodes'].items():
fs = node['fs']['total']
disk_percent = (fs['total_in_bytes'] - fs['available_in_bytes']) / fs['total_in_bytes'] * 100
if disk_percent > 85:
alert(f"Node {node['name']} disk is {disk_percent}%")
4.5 告警规则
关键告警:
# Prometheus告警规则
groups:
- name: elasticsearch
rules:
- alert: ClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 5m
annotations:
summary: "Cluster status is RED"
- alert: ClusterYellow
expr: elasticsearch_cluster_health_status{color="yellow"} == 1
for: 15m
annotations:
summary: "Cluster status is YELLOW for 15 minutes"
- alert: HighJVMMemory
expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
for: 10m
annotations:
summary: "JVM heap usage > 85%"
- alert: HighCPU
expr: elasticsearch_os_cpu_percent > 90
for: 10m
annotations:
summary: "CPU usage > 90%"
- alert: LowDiskSpace
expr: (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_available_bytes) / elasticsearch_filesystem_data_size_bytes > 0.85
for: 5m
annotations:
summary: "Disk usage > 85%"
- alert: TooManyPendingTasks
expr: elasticsearch_cluster_health_number_of_pending_tasks > 100
for: 10m
annotations:
summary: "Too many pending tasks"
五、高频面试题
为什么Master节点要奇数个?
答案: 基于quorum机制(法定人数)防止脑裂。
quorum公式:
quorum = (master_eligible_nodes / 2) + 1
3个master: quorum = 2,允许1个故障
4个master: quorum = 3,允许1个故障 (浪费资源)
5个master: quorum = 3,允许2个故障
奇数vs偶数:
- 3个和4个容错能力一样,但4个浪费资源
- 5个和6个容错能力一样,6个浪费资源
结论:奇数个更经济,推荐3或5个。
如何避免脑裂?
答案:
配置quorum:
- ES 7.x前:discovery.zen.minimum_master_nodes
- ES 7.x后:自动计算
至少3个Master节点:
- 2个无法满足quorum
- 3个允许1个故障
网络优化:
- 专用网络
- 低延迟
- 避免跨机房
合理超时:
- 增加ping_timeout
- 避免GC导致误判
集群状态为Yellow怎么办?
原因:副本分片未分配。
排查步骤:
# 1. 查看未分配分片
GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason
# 2. 查看具体原因
GET /_cluster/allocation/explain
# 3. 常见原因
- 节点数不足(副本数>节点数-1)
- 磁盘水位超标
- 分片分配策略限制
解决方案:
// 方案1:增加节点
// 方案2:减少副本数
PUT /products/_settings
{
"number_of_replicas": 0
}
// 方案3:调整磁盘水位
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.high": "95%"
}
}
主分片数能修改吗?为什么?
答案:不能。
原因:路由算法依赖主分片数。
shard_id = hash(document_id) % number_of_primary_shards
# 如果修改主分片数,路由结果变化,文档无法定位
原来: hash("doc-1") % 3 = 1 → shard 1
修改后: hash("doc-1") % 5 = 3 → shard 3 (找不到文档!)
解决方案:Reindex到新索引。
POST /_reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}
// 使用别名切换
POST /_aliases
{
"actions": [
{ "remove": { "index": "old_index", "alias": "products" } },
{ "add": { "index": "new_index", "alias": "products" } }
]
}
如何实现集群的高可用?
答案:
1. 节点层面:
- 至少3个Master节点(奇数)
- 每个索引至少1个副本
- 跨机架/可用区部署
2. 网络层面:
- 专用内网
- 冗余网络链路
- 负载均衡器(多Coordinating节点)
3. 数据层面:
- 定期备份(snapshot)
- 多副本(replicas)
- 跨集群复制(CCR)
4. 监控告警:
- 集群健康监控
- 节点资源监控
- 自动告警
架构示例:
负载均衡器
↓
Coordinating节点(2个)
↓
Master节点(3个,奇数)
↓
Data Hot节点(3个,AZ1/AZ2/AZ3)
Data Warm节点(3个,AZ1/AZ2/AZ3)
配置:
- number_of_replicas: 1
- cluster.routing.allocation.awareness.attributes: zone
六、实战技巧
6.1 滚动重启集群
场景:升级ES版本或修改配置,需重启集群。
步骤:
# 1. 禁用分片分配
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}
# 2. 停止一个节点
systemctl stop elasticsearch
# 3. 升级或修改配置
yum update elasticsearch
vi /etc/elasticsearch/elasticsearch.yml
# 4. 启动节点
systemctl start elasticsearch
# 5. 等待节点加入集群
GET /_cat/nodes?v
# 6. 重复2-5,逐个重启其他节点
# 7. 恢复分片分配
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
}
6.2 磁盘水位管理
水位阈值:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.threshold_enabled": true,
"cluster.routing.allocation.disk.watermark.low": "85%", // 低水位
"cluster.routing.allocation.disk.watermark.high": "90%", // 高水位
"cluster.routing.allocation.disk.watermark.flood_stage": "95%" // 洪水位
}
}
水位触发行为:
- 低水位(85%):不分配新分片到该节点
- 高水位(90%):尝试将分片迁移到其他节点
- 洪水位(95%):索引变为只读
只读恢复:
PUT /products/_settings
{
"index.blocks.read_only_allow_delete": null
}
6.3 备份恢复
配置Repository:
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "/mount/backups/my_backup",
"compress": true
}
}
创建快照:
PUT /_snapshot/my_backup/snapshot_1
{
"indices": "products,orders",
"ignore_unavailable": true,
"include_global_state": false
}
恢复快照:
POST /_snapshot/my_backup/snapshot_1/_restore
{
"indices": "products",
"rename_pattern": "(.+)",
"rename_replacement": "restored_$1"
}
6.4 跨集群搜索
// 配置远程集群
PUT /_cluster/settings
{
"persistent": {
"cluster.remote.cluster_two.seeds": [
"10.0.1.1:9300",
"10.0.1.2:9300"
]
}
}
// 跨集群搜索
GET /local_index,cluster_two:remote_index/_search
{
"query": {
"match": { "title": "elasticsearch" }
}
}