multi match query
是建立在match
之上的一种查询,它主要的特点就是允许对多个字段同时进行查询。
简单示例
POST _bulk
{ "index" : { "_index" : "book"} }
{"title":"programming language","content":"java go php python"}
{ "index" : { "_index" : "book"} }
{"title":"java","content":"Java is a popular programming language"}
{ "index" : { "_index" : "book"} }
{"title":"go","content":"Go is a young programming language"}
上面是构建测试数据的代码示例,下面我们实现搜索title
或content
中含有java
的文档,示例代码如下:
GET book/_search
{
"query": {
"multi_match": {
"query": "java",
"fields": ["title","content"]
}
}
}
运行结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0925692,
"hits" : [
{
"_index" : "book",
"_type" : "_doc",
"_id" : "kuCAzIQBRNx5Pfd58Qd_",
"_score" : 1.0925692,
"_source" : {
"title" : "java",
"content" : "Java is a popular programming language"
}
},
{
"_index" : "book",
"_type" : "_doc",
"_id" : "keCAzIQBRNx5Pfd58Qd_",
"_score" : 0.52354836,
"_source" : {
"title" : "programming language",
"content" : "java go php python"
}
}
]
}
}
可以看见不管是title
还是content
中含有java
关键字信息的文档都能被搜索出来。
在文档过多的情况下fields
还支持*
号匹配,例如*_name
,它的意思是只要_name
结尾的字段都可以包含进去。
同时对于字段还支持^
写法,该写法是用来增加字段的权重。例如title^2
则代表提升title
字段权重2倍。
类型
multi match query
查询内部具体如何执行的这个取决于type
参数。
best_fields
该值是默认值。该方式会为fields
中的每个字段生成一个match查询,然后再将它们组合到dis_max
查询内容。简单的说就是,最后相关性得分取的时候字段中的最高得分。
GET book/_search
{
"query": {
"multi_match": {
"query": "java",
"fields": ["title","content"]
}
}
}
最后的结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0925692,
"hits" : [
{
"_index" : "book",
"_type" : "_doc",
"_id" : "kuCAzIQBRNx5Pfd58Qd_",
"_score" : 1.0925692,
"_source" : {
"title" : "java",
"content" : "Java is a popular programming language"
}
},
{
"_index" : "book",
"_type" : "_doc",
"_id" : "keCAzIQBRNx5Pfd58Qd_",
"_score" : 0.52354836,
"_source" : {
"title" : "programming language",
"content" : "java go php python"
}
}
]
}
}
将上面查询换成如下查询:
GET book/_search
{
"query": {
"match": {
"title": "java"
}
}
}
GET book/_search
{
"query": {
"match": {
"content": "java"
}
}
}
最后结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0925692,
"hits" : [
{
"_index" : "book",
"_type" : "_doc",
"_id" : "kuCAzIQBRNx5Pfd58Qd_",
"_score" : 1.0925692,
"_source" : {
"title" : "java",
"content" : "Java is a popular programming language"
}
}
]
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.52354836,
"hits" : [
{
"_index" : "book",
"_type" : "_doc",
"_id" : "keCAzIQBRNx5Pfd58Qd_",
"_score" : 0.52354836,
"_source" : {
"title" : "programming language",
"content" : "java go php python"
}
},
{
"_index" : "book",
"_type" : "_doc",
"_id" : "kuCAzIQBRNx5Pfd58Qd_",
"_score" : 0.4471386,
"_source" : {
"title" : "java",
"content" : "Java is a popular programming language"
}
}
]
}
}
对比前面的multi_match
查询可以看出,multi_match
结果中的_score
取的是它们结果中的最高分。
most_fields
通常在索引一个用户的信息时,确定一个用户通常需要通过多个字段来确认。例如用户有first_name
和last_name
。如果使用best_fields
来处理,虽然能搜索出结果,但是却与我们实际需求不太符合。
POST _bulk
{ "index" : { "_index" : "user"} }
{"first_name":"zhang","last_name":"san"}
{ "index" : { "_index" : "user"} }
{"first_name":"zhang","last_name":"fei"}
{ "index" : { "_index" : "user"} }
{"first_name":"li","last_name":"san"}
在上面的测试数据中,如果我想找到zhang san
,使用multi_match
查询如下:
GET user/_search
{
"query": {
"multi_match": {
"query": "zhang san",
"fields": ["*_name"]
}
}
}
查询结果如下:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.47000363,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4ODEzIQBRNx5Pfd5XwmZ",
"_score" : 0.47000363,
"_source" : {
"first_name" : "zhang",
"last_name" : "san"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4eDEzIQBRNx5Pfd5XwmZ",
"_score" : 0.47000363,
"_source" : {
"first_name" : "zhang",
"last_name" : "fei"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4uDEzIQBRNx5Pfd5XwmZ",
"_score" : 0.47000363,
"_source" : {
"first_name" : "li",
"last_name" : "san"
}
}
]
}
}
虽然结果可以查询到zhang san
,但是结果存在一个问题,所有结果中相关性得分是一致的。而实际上文档4ODEzIQBRNx5Pfd5XwmZ
明显更符合我们的要求。因为默认情况下,_score
取的是最高得分。此时我们将type
修改成most_fields
,修改查询代码如下:
GET user/_search
{
"query": {
"multi_match": {
"query": "zhang san",
"fields": ["*_name"],
"type": "most_fields"
}
}
}
最后查询结果如下:
{
"took" : 17,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.94000727,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4ODEzIQBRNx5Pfd5XwmZ",
"_score" : 0.94000727,
"_source" : {
"first_name" : "zhang",
"last_name" : "san"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4eDEzIQBRNx5Pfd5XwmZ",
"_score" : 0.47000363,
"_source" : {
"first_name" : "zhang",
"last_name" : "fei"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4uDEzIQBRNx5Pfd5XwmZ",
"_score" : 0.47000363,
"_source" : {
"first_name" : "li",
"last_name" : "san"
}
}
]
}
}
从这次结果可以看出,文档4ODEzIQBRNx5Pfd5XwmZ
的得分明显高于后面的两个文档。most_fields
它并不是取的最高得分,而是把所有字段的得分进行合并计算,这样得出的结果明显更加符合该场景。
cross_fields
使用同一个分析器处理字段,就好像他们是一个大字段一样。在前面讲most_fields
时通过在first_name
和last_name
同时搜索zhang san
,最后将它们的得分相加计算_score
。听上去好像很完美,但是实际上还是会有问题。对于像operator
和minimum_should_match
这样的参数,它分别作用于每个字段,这很可能不是你想要的效果。
另外一个问题是,对于要搜索的词其在不同文档中词频不相同,这很可能引发另外一个问题。例如在搜索zhang san
,很可能出现zhang san
的得分比san mao
的得分更低。在first_name
字段中,zhang
的得分为0.2
,而san
的得分为0.6
。而last_name
中,san
的得分为0.1
。在这种情况下就会出现zhang san
的得分比san mao
的得分更低。
为了避免这种情况,我们可以创建一个字段full_name
,然后将first_name
和last_name
组成一个字段,最后搜索时搜索full_name
字段。但是这种方式需要我们修改索引,而更简便的方法则是将multi_match
中的type
修改成cross_fields
。
POST _bulk
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"san"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"fei"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"si"}
{ "index" : { "_index" : "user1"} }
{"first_name":"san","last_name":"si"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"si"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"wu"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"wu"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"wu"}
{ "index" : { "_index" : "user1"} }
{"first_name":"zhang","last_name":"wu"}
{ "index" : { "_index" : "user1"} }
{"first_name":"li","last_name":"san"}
{ "index" : { "_index" : "user1"} }
{"first_name":"li","last_name":"san"}
上面是构建的测试数据,用来模拟上述情况,首先使用most_fields
查询,示例代码如下:
GET user1/_search
{
"query": {
"multi_match": {
"query": "zhang san",
"fields": ["*_name"],
"type": "most_fields"
}
}
}
响应结果如下:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 11,
"relation" : "eq"
},
"max_score" : 2.0794415,
"hits" : [
{
"_index" : "user1",
"_type" : "_doc",
"_id" : "PUnDzYQBeeHniJKOJzMs",
"_score" : 2.0794415,
"_source" : {
"first_name" : "san",
"last_name" : "si"
}
},
{
"_index" : "user1",
"_type" : "_doc",
"_id" : "OknDzYQBeeHniJKOJzMs",
"_score" : 1.5769842,
"_source" : {
"first_name" : "zhang",
"last_name" : "san"
}
},
{
"_index" : "user1",
"_type" : "_doc",
"_id" : "Q0nDzYQBeeHniJKOJzMs",
"_score" : 1.2321436,
"_source" : {
"first_name" : "li",
"last_name" : "san"
}
}
]
}
}
实际上查询到的数据更多,为了减少篇幅我这里只截取了部分代表性数据。从结果可以看出san si
的得分比zhang san
的得分还高。这就是因为zhang
在first_name
中出现词频太高导致得分太低,而san
在first_name
中只出现了一次,所以其得分很高。而san
在last_name
中出现的次数也比较多,最后导致的结果就是zhang san
总得分没有san si
总得分高。
我们修改typew
为cross_fields
后,再次执行查询结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 11,
"relation" : "eq"
},
"max_score" : 1.5769842,
"hits" : [
{
"_index" : "user1",
"_type" : "_doc",
"_id" : "OknDzYQBeeHniJKOJzMs",
"_score" : 1.5769842,
"_source" : {
"first_name" : "zhang",
"last_name" : "san"
}
},
{
"_index" : "user1",
"_type" : "_doc",
"_id" : "Q0nDzYQBeeHniJKOJzMs",
"_score" : 1.2321436,
"_source" : {
"first_name" : "li",
"last_name" : "san"
}
},
{
"_index" : "user1",
"_type" : "_doc",
"_id" : "REnDzYQBeeHniJKOJzMs",
"_score" : 1.2321436,
"_source" : {
"first_name" : "li",
"last_name" : "san"
}
}
]
}
}
从该次结果可以看出,zhang san
的相关性得分最高。
phrase和phrase_prefix
这两种类型的处理方式与best_fields
一样,不一样的在于查询方式。best_fields
对于每个字段使用的是match
,而phrase
对每个字段使用的是match phrase
,phrase_prefix
对每个字段使用的是match_phrase_prefix
。
tie_breaker
默认情况下,也就是type为best_fields
时,_score
得分获取的是最高得分,对于其他字段的得分不计入到最后得分。该参数支持设置0至1的数字,使得其他字段得分也会加入到最后的得分中。例如该值为0.5时,查询会获取字段的最高得分,然后再把其他字段得分相加乘以该系数,最后获取的就是该文档的得分了。
ES全文检索-match phrase prefix query
原文始发于微信公众号(一只菜鸟程序员):ES全文检索-multi match query
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/72820.html