相关性算分
一、TF-IDF算法
1.1 什么是TF-IDF
TF-IDF(Term Frequency-Inverse Document Frequency)是传统的文本相关性算法,Elasticsearch 5.x之前的默认算分算法。
核心思想:
- TF(词频):词在文档中出现次数越多,相关性越高
- IDF(逆文档频率):词在所有文档中出现次数越少,区分度越高
1.2 TF-IDF计算公式
完整公式:
score(q,d) = coord(q,d) × queryNorm(q) × ∑(tf(t in d) × idf(t)² × t.getBoost() × norm(t,d))
其中:
- q: 查询语句
- d: 文档
- t: 词项(term)
各部分详解:
1. TF(词频):
tf(t in d) = √frequency
frequency: 词t在文档d中出现的次数
示例:
"java" 在文档中出现4次
tf = √4 = 2
2. IDF(逆文档频率):
idf(t) = 1 + log(numDocs / (docFreq + 1))
numDocs: 索引中总文档数
docFreq: 包含词t的文档数
示例:
索引有10000个文档,其中100个包含"elasticsearch"
idf = 1 + log(10000 / 101) ≈ 5.6
3. Norm(字段长度归一化):
norm(d) = 1 / √numTerms
numTerms: 文档字段的词项数
示例:
文档有16个词
norm = 1 / √16 = 0.25
1.3 TF-IDF实战示例
场景:搜索"Elasticsearch Java"
// 创建索引和文档
PUT /articles
{
"settings": {
"similarity": {
"my_tfidf": {
"type": "classic" // 使用TF-IDF算法
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"similarity": "my_tfidf"
}
}
}
}
// 插入测试文档
PUT /articles/_doc/1
{
"title": "Elasticsearch is a distributed search engine"
}
PUT /articles/_doc/2
{
"title": "Java is great for building Elasticsearch clients"
}
PUT /articles/_doc/3
{
"title": "Elasticsearch Java API tutorial"
}
// 查询并查看算分
GET /articles/_search
{
"query": {
"match": {
"title": "Elasticsearch Java"
}
},
"explain": true // 查看详细算分过程
}
算分结果分析:
{
"hits": [
{
"_id": "3",
"_score": 1.89,
"_explanation": {
"value": 1.89,
"description": "sum of:",
"details": [
{
"value": 0.92,
"description": "weight(title:elasticsearch in 0)",
"details": [
{
"value": 0.46,
"description": "tf(freq=1.0) = √1 = 1.0"
},
{
"value": 2.0,
"description": "idf = 1 + log(3/2) = 2.0"
}
]
},
{
"value": 0.97,
"description": "weight(title:java in 0)"
}
]
}
}
]
}
1.4 TF-IDF的局限性
问题1:词频饱和
- 一个词出现5次和50次,相关性差异不大
- TF使用√频率,增长过慢
问题2:文档长度惩罚不足
- 长文档更容易包含查询词
- 简单的norm归一化不够精确
问题3:不考虑词的位置
- "Java Elasticsearch"和"Elasticsearch Java"得分相同
- 无法区分词的重要性位置
二、BM25算法
2.1 BM25概述
BM25(Best Matching 25)是Elasticsearch 5.x之后的默认算分算法,改进了TF-IDF的不足。
核心改进:
- 非线性TF增长:词频饱和控制
- 文档长度归一化:更精确的长度惩罚
- 可调参数:k1和b控制算分特性
2.2 BM25计算公式
完整公式:
score(D,Q) = ∑ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D| / avgdl))
参数说明:
- D: 文档
- Q: 查询
- qi: 查询中的第i个词
- f(qi,D): qi在文档D中的词频
- |D|: 文档D的长度(词数)
- avgdl: 所有文档的平均长度
- k1: 词频饱和参数(默认1.2)
- b: 长度归一化参数(默认0.75)
IDF计算:
IDF(qi) = log(1 + (N - n(qi) + 0.5) / (n(qi) + 0.5))
N: 总文档数
n(qi): 包含qi的文档数
2.3 BM25参数调优
k1参数(控制词频饱和度):
k1范围: [1.2 - 2.0]
k1 = 0: 忽略词频,只考虑IDF
k1 → ∞: 无饱和限制,类似TF-IDF
k1 = 1.2(默认): 适度饱和
效果:
- 小k1: 词频增长快速饱和,适合短文档
- 大k1: 词频影响更大,适合长文档
b参数(控制长度归一化):
b范围: [0 - 1]
b = 0: 不考虑文档长度
b = 1: 完全归一化
b = 0.75(默认): 部分归一化
效果:
- b接近0: 长文档不被惩罚,适合长度差异小的场景
- b接近1: 长文档被严格惩罚,适合长度差异大的场景
2.4 BM25实战配置
// 场景1:电商商品搜索(标题长度相似)
PUT /products
{
"settings": {
"index": {
"similarity": {
"my_bm25": {
"type": "BM25",
"k1": 1.2, // 标准词频饱和
"b": 0.5 // 降低长度惩罚(标题长度相似)
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"similarity": "my_bm25"
}
}
}
}
// 场景2:文章搜索(长度差异大)
PUT /articles
{
"settings": {
"index": {
"similarity": {
"article_bm25": {
"type": "BM25",
"k1": 2.0, // 增加词频影响
"b": 0.9 // 严格长度归一化
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"similarity": "article_bm25"
}
}
}
}
// 场景3:日志搜索(精确匹配)
PUT /logs
{
"settings": {
"index": {
"similarity": {
"log_bm25": {
"type": "BM25",
"k1": 0, // 不考虑词频
"b": 0 // 不考虑长度
}
}
}
},
"mappings": {
"properties": {
"message": {
"type": "text",
"similarity": "log_bm25"
}
}
}
}
2.5 BM25 vs TF-IDF对比
// 对比测试
PUT /compare_test/_doc/1
{
"text": "java" // 短文档,词出现1次
}
PUT /compare_test/_doc/2
{
"text": "java java java java java java java java java java" // 词出现10次
}
PUT /compare_test/_doc/3
{
"text": "java " + "filler " * 100 // 长文档,词出现1次
}
// TF-IDF结果:
// Doc1: 1.0
// Doc2: 3.16 (√10)
// Doc3: 0.1 (长度惩罚)
// BM25结果:
// Doc1: 1.0
// Doc2: 1.8 (饱和效果)
// Doc3: 0.6 (更温和的长度惩罚)
三、Function Score查询
3.1 Function Score概述
Function Score允许自定义算分逻辑,结合业务需求调整相关性得分。
使用场景:
- 根据销量、评分、价格等业务字段调整排序
- 时间衰减(新文档排前面)
- 地理位置距离衰减
- 随机打分(推荐系统)
3.2 Function Score语法
基本结构:
GET /products/_search
{
"query": {
"function_score": {
"query": { "match": { "title": "手机" } }, // 原始查询
"functions": [
// 自定义算分函数
],
"score_mode": "sum", // 多个函数的组合方式
"boost_mode": "multiply", // 与原始分数的组合方式
"max_boost": 10, // 最大boost值
"min_score": 1 // 最小分数阈值
}
}
}
score_mode(函数分数组合):
multiply: 相乘(默认)sum: 相加avg: 平均max: 取最大min: 取最小first: 第一个匹配的函数
boost_mode(与原始分数组合):
multiply: new_score = query_score × function_scoresum: new_score = query_score + function_scorereplace: new_score = function_scoreavg: new_score = (query_score + function_score) / 2
3.3 常用算分函数
1. weight(固定权重):
{
"function_score": {
"query": { "match": { "title": "手机" } },
"functions": [
{
"filter": { "term": { "brand": "Apple" } },
"weight": 2 // Apple品牌文档得分×2
},
{
"filter": { "range": { "price": { "gte": 5000 } } },
"weight": 1.5 // 高端机得分×1.5
}
],
"score_mode": "multiply",
"boost_mode": "multiply"
}
}
2. field_value_factor(字段值影响分数):
{
"function_score": {
"query": { "match": { "title": "手机" } },
"functions": [
{
"field_value_factor": {
"field": "sales", // 销量字段
"factor": 0.1, // 缩放因子
"modifier": "log1p", // 函数:log(1 + sales)
"missing": 1 // 缺失值默认为1
}
}
],
"boost_mode": "sum"
}
}
modifier选项:
none: sales × factorlog: log(sales) × factorlog1p: log(1 + sales) × factor(推荐)log2p: log(2 + sales) × factorln: ln(sales) × factorln1p: ln(1 + sales) × factorln2p: ln(2 + sales) × factorsquare: sales² × factorsqrt: √sales × factorreciprocal: 1/sales × factor
3. decay函数(衰减函数):
时间衰减:
{
"function_score": {
"query": { "match": { "title": "新闻" } },
"functions": [
{
"gauss": {
"publish_date": {
"origin": "2024-01-01", // 最优时间点
"scale": "30d", // 衰减到0.5的距离
"offset": "7d", // 不衰减的范围
"decay": 0.5 // scale距离处的衰减系数
}
}
}
]
}
}
地理位置衰减:
{
"function_score": {
"query": { "match": { "name": "餐厅" } },
"functions": [
{
"exp": {
"location": {
"origin": "31.23,121.47", // 用户位置
"scale": "2km", // 衰减到0.5的距离
"offset": "500m", // 500米内不衰减
"decay": 0.33 // 2km处衰减到0.33
}
}
}
]
}
}
价格衰减:
{
"function_score": {
"query": { "match": { "title": "笔记本" } },
"functions": [
{
"linear": {
"price": {
"origin": "5000", // 理想价格
"scale": "2000", // 衰减到0.5的距离
"decay": 0.5
}
}
}
]
}
}
衰减函数对比:
linear: 线性衰减,简单直观exp: 指数衰减,衰减快gauss: 高斯衰减,平滑自然(推荐)
4. random_score(随机打分):
{
"function_score": {
"query": { "match_all": {} },
"functions": [
{
"random_score": {
"seed": 12345, // 随机种子(相同seed结果一致)
"field": "_seq_no" // 基于字段生成随机数
}
}
]
}
}
5. script_score(脚本打分):
{
"function_score": {
"query": { "match": { "title": "手机" } },
"functions": [
{
"script_score": {
"script": {
"source": "Math.log(2 + doc['sales'].value) * params.factor",
"params": {
"factor": 1.2
}
}
}
}
]
}
}
3.4 综合案例:电商搜索排序
需求:
- 文本相关性匹配
- 销量高的排前面
- 评分高的排前面
- 新品优先
- 品牌加权
GET /products/_search
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "苹果手机",
"fields": ["title^3", "description"]
}
},
"functions": [
{
"filter": { "term": { "brand": "Apple" } },
"weight": 1.5
},
{
"field_value_factor": {
"field": "sales",
"factor": 0.01,
"modifier": "log1p"
}
},
{
"field_value_factor": {
"field": "rating",
"factor": 1.0,
"modifier": "sqrt"
}
},
{
"gauss": {
"created_at": {
"origin": "now",
"scale": "90d",
"offset": "7d",
"decay": 0.5
}
}
}
],
"score_mode": "sum",
"boost_mode": "multiply",
"min_score": 5
}
},
"size": 20
}
四、Rescore重打分
4.1 Rescore原理
Rescore是一种二阶段检索策略:
- 第一阶段:使用简单查询快速召回TopN文档
- 第二阶段:对TopN文档使用复杂算法精确打分
优势:
- 平衡性能和精度
- 复杂算分只作用于少量文档
- 适合深度学习排序模型
4.2 Rescore语法
GET /products/_search
{
"query": {
"match": { "title": "手机" } // 第一阶段:快速召回
},
"rescore": {
"window_size": 50, // 重打分窗口(前50个文档)
"query": {
"rescore_query": {
"function_score": { // 第二阶段:精确打分
"query": {
"match_phrase": {
"title": {
"query": "苹果手机",
"slop": 2
}
}
},
"functions": [
{
"field_value_factor": {
"field": "sales",
"modifier": "log1p"
}
}
]
}
},
"query_weight": 0.7, // 原始查询权重
"rescore_query_weight": 1.3 // 重打分查询权重
}
}
}
分数计算:
final_score = query_score × query_weight + rescore_score × rescore_query_weight
4.3 多级Rescore
{
"query": { "match": { "title": "手机" } },
"rescore": [
{
"window_size": 100,
"query": {
"rescore_query": {
"match_phrase": { "title": "苹果手机" }
},
"query_weight": 0.7,
"rescore_query_weight": 1.2
}
},
{
"window_size": 50, // 第二级窗口必须≤第一级
"query": {
"rescore_query": {
"function_score": {
"field_value_factor": {
"field": "sales",
"modifier": "log1p"
}
}
},
"query_weight": 0.8,
"rescore_query_weight": 1.5
}
}
]
}
4.4 Rescore最佳实践
window_size选择:
场景1:PC端搜索(每页20条)
└─ window_size: 100-200
场景2:移动端搜索(每页10条)
└─ window_size: 50-100
场景3:推荐系统(取前10)
└─ window_size: 30-50
原则:
- window_size = page_size × (2-5)
- 太小:精度不够
- 太大:性能下降
适用场景:
- 短语查询(match_phrase)
- 复杂Function Score
- 机器学习排序模型
- 个性化推荐
五、高频面试题
BM25相比TF-IDF的优势是什么?
答案:
词频饱和控制:
- TF-IDF:词出现10次和100次差异很大
- BM25:通过k1参数控制饱和,词频增长有上限
更精确的长度归一化:
- TF-IDF:简单的1/√length
- BM25:通过b参数可调节归一化程度
可调参数:
- k1控制词频影响
- b控制长度惩罚
- 可针对不同场景优化
代码对比:
// TF-IDF:词频线性增长
score ∝ √frequency
// BM25:词频饱和增长
score ∝ frequency / (frequency + k1)
如何实现"销量高的商品排前面"?
方案1:field_value_factor:
{
"function_score": {
"query": { "match": { "title": "手机" } },
"field_value_factor": {
"field": "sales",
"factor": 0.1,
"modifier": "log1p"
},
"boost_mode": "sum"
}
}
方案2:script_score:
{
"function_score": {
"query": { "match": { "title": "手机" } },
"script_score": {
"script": "_score * Math.log(2 + doc['sales'].value)"
}
}
}
方案3:sort排序:
{
"query": { "match": { "title": "手机" } },
"sort": [
{ "_score": "desc" },
{ "sales": "desc" }
]
}
Function Score的score_mode和boost_mode区别?
score_mode:多个function之间如何组合
functions: [
{ weight: 2 }, // func1
{ weight: 3 } // func2
]
multiply: 2 × 3 = 6
sum: 2 + 3 = 5
avg: (2 + 3) / 2 = 2.5
max: max(2, 3) = 3
boost_mode:function结果与原始查询分数如何组合
query_score = 10
function_score = 5
multiply: 10 × 5 = 50
sum: 10 + 5 = 15
replace: 5
avg: (10 + 5) / 2 = 7.5
什么时候用Rescore?
适用场景:
复杂算分逻辑:
- 短语查询(match_phrase)
- 多字段Function Score
- 机器学习模型
深分页问题:
- 只对TopN重打分
- 避免全量复杂计算
个性化推荐:
- 第一阶段:通用召回
- 第二阶段:个性化排序
示例:
// 第一阶段:快速召回1000个
// 第二阶段:对前100个精确打分
{
"query": { "match": { "title": "手机" } },
"size": 20,
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"function_score": {
// 复杂的个性化算分逻辑
}
}
}
}
}
如何调试算分结果?
方法1:explain参数:
GET /products/_search
{
"query": { "match": { "title": "手机" } },
"explain": true
}
// 返回详细算分过程
{
"_explanation": {
"value": 3.14,
"description": "sum of:",
"details": [
{
"value": 2.5,
"description": "weight(title:手机)",
"details": [
{ "value": 1.0, "description": "tf" },
{ "value": 2.5, "description": "idf" }
]
}
]
}
}
方法2:explain API:
GET /products/_explain/1
{
"query": { "match": { "title": "手机" } }
}
方法3:profile API(性能分析):
GET /products/_search
{
"profile": true,
"query": { "match": { "title": "手机" } }
}
六、实战技巧
6.1 自定义相似度算法
PUT /articles
{
"settings": {
"similarity": {
"my_similarity": {
"type": "scripted",
"script": {
"source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"similarity": "my_similarity"
}
}
}
}
6.2 负向权重(降低某些文档排序)
{
"function_score": {
"query": { "match": { "title": "手机" } },
"functions": [
{
"filter": { "term": { "is_sold_out": true } },
"weight": 0.1 // 缺货商品降权
}
],
"boost_mode": "multiply"
}
}
6.3 组合多种衰减函数
{
"function_score": {
"query": { "match": { "title": "餐厅" } },
"functions": [
{
"gauss": {
"location": {
"origin": "31.23,121.47",
"scale": "2km"
}
}
},
{
"linear": {
"price": {
"origin": "100",
"scale": "50"
}
}
},
{
"exp": {
"created_at": {
"origin": "now",
"scale": "30d"
}
}
}
],
"score_mode": "multiply"
}
}