Elasticsearch 聚合

Elasticsearch是可以自动判断字段的类型的,这个步骤本可以不要,但是,我发现在5.3版本测试的时候,如果不设置fielddata=true,后面的汇聚会报:

1
Fielddata is disabled on text fields by default. Set fielddata=true

于是,为color、make两个文本字段设置fielddata=true选项,这个步骤要在插入数据前,类似于提前创建表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PUT
cars
{
"mappings": {
"transations": {
"properties": {
"color": {
"type": "text",
"fielddata": true
},
"make": {
"type": "text",
"fielddata" : true,
}
}
}
}
}

批量插入数据:

注意:
最后一条数据后面有一个换行符,不然最后一条数据插入不进去。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
POST
cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

scope. By default, aggregations operate in the same scope as the query. Put another way, aggregations are calculated on the set of documents that match your query.

汇聚前,需要数据源,数据源来自本级的查询过滤结果,本级的查询也需要数据源,其数据源来自嵌套上级的输出。
如果没有本级查询过滤时,表示原封不动保留嵌套上级的输出。
如果有本级查询过滤,不论aggs语句是写在查询过滤前还是后,都会先执行查询过滤语句,然后再执行aggs。

汇聚不同色系的汽车,看一下销量情况

1
2
3
4
5
6
7
8
9
10
11
12
GET
cars/_search
{
"size": 0, //这里没有Query语句,表示所有cars数据都需要,同时size设置为0,表示我们不想输出查询的结果。
"aggs": {
"popular_colors": {
"terms": {
"field": "color" //我们以color字段的不同value做桶,不同value分装到不同的桶中。
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"popular_colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 4
},
{
"key": "blue",
"doc_count": 2
},
{
"key": "green",
"doc_count": 2
}
]
}
}
}

汇聚不同色系的汽车,查看该色系的均价

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET
cars/_search
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "color"
},
//popular_colors按色系分桶后,得到的数据,便是aggs的源数据,在这里,不论terms.field=color写在aggs前还是后,效果都一样。
"aggs": {
"popular_colors_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"popular_colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 4,
"popular_colors_avg_price": {
"value": 32500
}
},
{
"key": "blue",
"doc_count": 2,
"popular_colors_avg_price": {
"value": 20000
}
},
{
"key": "green",
"doc_count": 2,
"popular_colors_avg_price": {
"value": 21000
}
}
]
}
}
}

汇聚嵌套

汇聚不同色系的汽车,查看该色系的均价,并查看该色系下制造商的分布情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
GET
cars/_search
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "color"
},
"aggs": {
"popular_colors_avg_price": {
"avg": {
"field": "price"
}
},
// 这里的数据源,是上级桶的数据,也就是popular_colors分桶后的数据。所以,这里,其实是在一个色系内部做制造商划分。
"who_maked": {
"terms": {
"field": "make"
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"popular_colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 4,
"who_maked": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "honda",
"doc_count": 3
},
{
"key": "bmw",
"doc_count": 1
}
]
},
"popular_colors_avg_price": {
"value": 32500
}
},
{
"key": "blue",
"doc_count": 2,
"who_maked": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ford",
"doc_count": 1
},
{
"key": "toyota",
"doc_count": 1
}
]
},
"popular_colors_avg_price": {
"value": 20000
}
},
{
"key": "green",
"doc_count": 2,
"who_maked": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ford",
"doc_count": 1
},
{
"key": "toyota",
"doc_count": 1
}
]
},
"popular_colors_avg_price": {
"value": 21000
}
}
]
}
}
}

相同数据源多次汇聚

汇聚不同颜色的汽车,并看一下不同颜色汽车的均价;同时针对所有汽车,查看其制造商的分布情况

和上面的嵌套汇聚不同,我们不做嵌套,只是将两次汇聚语句放到一次查询后进行。
如果我们将who_maked放到和popular_colors同一级,那么他们的源数据都是其parent aggs层的数据(下面的语句没有查询,默认就是所有):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "color"
},
"aggs": {
"popular_colors_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"who_maked": {
"terms": {
"field": "make"
}
}
}
}

返回结果,who_maked和popular_colors在同一级,who_maked表示,针对所有的汽车,制造商的分布状况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"popular_colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 4,
"popular_colors_avg_price": {
"value": 32500
}
},
{
"key": "blue",
"doc_count": 2,
"popular_colors_avg_price": {
"value": 20000
}
},
{
"key": "green",
"doc_count": 2,
"popular_colors_avg_price": {
"value": 21000
}
}
]
},
"who_maked": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "honda",
"doc_count": 3
},
{
"key": "ford",
"doc_count": 2
},
{
"key": "toyota",
"doc_count": 2
},
{
"key": "bmw",
"doc_count": 1
}
]
}
}
}

有了上面的认识,假设我们想要获取某色系的最高价和最低价,那可以将:

1
2
"min_price" : { "min": { "field": "price"} },
"max_price" : { "max": { "field": "price"} }

加入同popular_colors_avg_price平级的地方。

如果我们是想获取某色系里某制造商的最高价和最低价,应该将

1
2
3
4
"aggs" : {
"min_price" : { "min": { "field": "price"} },
"max_price" : { "max": { "field": "price"} }
}

加到who_maked内部(因为who_maked没有aggs,所以要包围在aggs内部),不论是在terms前或者后,都是一个效果。

直方图统计

histogram 这个桶,可以将field按照interval大小分段。

以2万为一个段,统计不同价格区间,汽车的销量

1
2
3
4
5
6
7
8
9
10
11
12
13
GET
cars/transactions/_search
{
"size": 0,
"aggs": {
"price": {
"histogram": {
"field": "price",
"interval": 20000
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"price": {
"buckets": [
{
"key": 0,
"doc_count": 3
},
{
"key": 20000,
"doc_count": 4
},
{
"key": 40000,
"doc_count": 0
},
{
"key": 60000,
"doc_count": 0
},
{
"key": 80000,
"doc_count": 1
}
]
}
}
}

以2万为一个段,统计不同价格区间,汽车的销量,并且将价格详细分布展示出来

和前面的聚合比较,这里需要显示价格分布细节,其实就是在上面的 histogram 标签后面,添加子聚合:

1
2
3
4
5
6
7
"aggs": {
"price_items": {
"terms": {
"field": "price"
}
}
}

返回的每一个bucket里面,会多了子buckets(price_items项):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
"key": 0,
"doc_count": 3,
"price_items": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10000,
"doc_count": 1
},
{
"key": 12000,
"doc_count": 1
},
{
"key": 15000,
"doc_count": 1
}
]
}
}

和kibana结合

假设我们以2万为价格区间,统计不同制造商的销售情况,查询语句应该是:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET
cars/transactions/_search
{
"size": 0,
"aggs": {
"price": {
"histogram": {
"field": "price",
"interval": 20000
},
"aggs": {
"make_detail": {
"terms": {
"field": "make"
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"price": {
"buckets": [
{
"key": 0,
"doc_count": 3,
"make_detail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "toyota",
"doc_count": 2
},
{
"key": "honda",
"doc_count": 1
}
]
}
},
{
"key": 20000,
"doc_count": 4,
"make_detail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ford",
"doc_count": 2
},
{
"key": "honda",
"doc_count": 2
}
]
}
},
{
"key": 40000,
"doc_count": 0,
"make_detail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
]
}
},
{
"key": 60000,
"doc_count": 0,
"make_detail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
]
}
},
{
"key": 80000,
"doc_count": 1,
"make_detail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bmw",
"doc_count": 1
}
]
}
}
]
}
}
}

我们之前讲过 kibana的安装,现在,我们利用kibana的绘图功能,将上面的查询,对应绘制出来。

我们从浏览器打开kibana,配置好数据源

date histogram

date_histogram聚合方法,需要设置是以哪个field当成时间字段,统计间隙interval是多久,输出时间的字符串格式为format:
请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET
cars/transactions/_search
{
"size" : 0,
"aggs": {
"sales": {
"date_histogram": {
"field": "sold",
"interval": "month",
"format": "yyyy-MM-dd"
}
}
}
}

返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"sales": {
"buckets": [
{
"key_as_string": "2014-01-01",
"key": 1388534400000,
"doc_count": 1
},
{
"key_as_string": "2014-02-01",
"key": 1391212800000,
"doc_count": 1
},
{
"key_as_string": "2014-03-01",
"key": 1393632000000,
"doc_count": 0
},
{
"key_as_string": "2014-04-01",
"key": 1396310400000,
"doc_count": 0
},
{
"key_as_string": "2014-05-01",
"key": 1398902400000,
"doc_count": 1
},
{
"key_as_string": "2014-06-01",
"key": 1401580800000,
"doc_count": 0
},
{
"key_as_string": "2014-07-01",
"key": 1404172800000,
"doc_count": 1
},
{
"key_as_string": "2014-08-01",
"key": 1406851200000,
"doc_count": 1
},
{
"key_as_string": "2014-09-01",
"key": 1409529600000,
"doc_count": 0
},
{
"key_as_string": "2014-10-01",
"key": 1412121600000,
"doc_count": 1
},
{
"key_as_string": "2014-11-01",
"key": 1414800000000,
"doc_count": 2
}
]
}
}
}

query filter

只查看ford汽车的色系销售情况

上面的查询聚合了所有制造商的色系销量,如今我们只想查看ford的,可以通过查询语句,先查询出ford的汽车,然后再跑上面一样的汇聚:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
GET
cars/transactions/_search
{
"query": { //这个部分,是新增的查询语句
"match": {
"make": "ford"
}
},
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "color"
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"popular_colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "blue",
"doc_count": 1
},
{
"key": "green",
"doc_count": 1
}
]
}
}
}

全局桶

一般来说,我们聚合的范围都是查询过滤的结果范围,但是,我们可能也会突然需要在当前上下文聚合全局的数据。例如,我们需要对比,ford汽车的均价和所有制造商汽车的均价。
这个时候,你就需要指定全局桶范围,全局桶不受当前查询范围影响,它包含所有数据。

ford汽车的均价和所有制造商汽车的均价

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
GET
cars/transactions/_search
{
"size": 0,
"query": {
"match": {
"make": "ford"
}
},
"aggs": {
"福特均价": {
"avg": {
"field": "price"
}
},
"全局桶": {
"global": {},
"aggs": {
"全部均价": {
"avg": {
"field": "price"
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"福特均价": {
"value": 27500
},
"全局桶": {
"doc_count": 8,
"全部均价": {
"value": 26500
}
}
}
}

match 和 non-scoring query(filter)

使用 constant_score 可以在查询的时候,不计算score,match是会计算score的,上面的语句,我们改写为 constant_score 方式为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"size": 0,
"query": {
//这里从match修改为constant_score filter
"constant_score": {
"filter": {
"term": {
"make": "ford"
}
}
}
},
//下面是一样的
"aggs": {
"福特均价": {
"avg": {
"field": "price"
}
},
"全局桶": {
"global": {},
"aggs": {
"全部均价": {
"avg": {
"field": "price"
}
}
}
}
}
}

桶前过滤

在aggs内部,装入桶前,可以先过滤,filter需要的数据来分桶。当然,如果我们只做一次aggs,这个filter也可以加在查询的时候。
例如,我们先查询了所有honda的汽车销售情况,在aggs阶段,我们只想对2014-11-01 - 2014-11-15之间的数据做半月销售统计,就可以在aggs内加入时间过滤,这种方式的优点是,影响的作用域就是自己和子孙,不会影响兄弟。如下,我们做了”honda半月售价情况”统计后,我们还想做”honda所有时间段售价情况”统计,aggs内部filter,便派上了用场。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
{
"size": 0,
"query": {
"match": {
"make": "honda" //没有这条查询,原本数据是8条,查询后,数据是3条
}
},
"aggs": {
//这里数据是3条
"honda半月售价情况": {
"filter": {
"range": {
"sold": {
"from": "2014-11-01",
"to": "2014-11-15"
}
//按照时间过滤后,变成了2条数据
}
},
//这里数据是2条
"aggs": {
"honda半月售价分布": {
"terms": {
"field": "price"
}
},
"honda半月销售额": {
"sum": {
"field": "price"
}
}
}
},
//"honda半月售价情况"内部做的filter不会影响到平级的"honda所有时间段售价情况" filter,所以,这里数据还是3条
"honda所有时间段售价情况": {
"terms": {
"field": "price"
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"honda所有时间段售价情况": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 20000,
"doc_count": 2
},
{
"key": 10000,
"doc_count": 1
}
]
},
"honda半月售价情况": {
"doc_count": 2,
"honda半月售价分布": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 20000,
"doc_count": 2
}
]
},
"honda半月销售额": {
"value": 40000
}
}
}
}

post filter

post_filter过滤器,作用于输出query结果给用户之前,顺序是:
query -> aggs -> post_filter
所以,post_filter不会影响aggs结果,但是会影响hits返回结果。

例如 我想要ford汽车的色系销量情况并且同时返回ford汽车的绿色汽车详情:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"size": 10,
"query": {
"match": {
"make": "ford"
}
},
"post_filter": { //有了这个post_filter,返回的hits内容里,原本应该是2条结果,现在过滤后只有1条
"term": {
"color": "green"
}
},
"aggs": { // aggs不受影响,这里仍然显示所有色系的聚合结果
"all_colors": {
"terms": {
"field": "color"
}
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.2039728,
"hits": [
{
"_index": "cars",
"_type": "transactions",
"_id": "AV7PgBdUSAU1E8xKCaY2",
"_score": 1.2039728,
"_source": {
"price": 30000,
"color": "green",
"make": "ford",
"sold": "2014-05-18"
}
}
]
},
"aggregations": {
"all_colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "blue",
"doc_count": 1
},
{
"key": "green",
"doc_count": 1
}
]
}
}
}

如果单纯为了过滤,不应该用post_filter,因为他作用的阶段是在查询输出给用户前,很多filter cache都没有,效益不高。

汇聚的排序

我们可以在aggs里面,使用order来指定不同的排序方法,默认情况下,elasticsearch汇聚在不同的桶后,输出的时候,是按照doc_count降序输出,类似于:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"size": 0,
"aggs": {
"colors": {
"terms": {
"field": "color",
//这部分,加了和没加,效果是一样
"order": {
"_count": "desc" //如果我们想要升序,修改desc为asc
}
}
}
}
}

order可以在terms, histogram, date_histogram内部使用,但order内部的这些排序方法,使用条件不同:

  1. _count
    按照文档数排序,可在 terms, histogram, date_histogram 内使用。示例如上。
  2. _term
    按照字母顺序排序.可在 terms 内使用。例如:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    {
    "size": 0,
    "aggs": {
    "colors": {
    "terms": {
    "field": "color",
    "order": { //color字段的值以字母顺序降序排列
    "_term": "asc"
    }
    }
    }
    }
    }
  3. _key
    对桶的key按照数字排序,可以在histogram, date_histogram内使用。例如:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    {
    "size": 0,
    "aggs": {
    "sales": {
    "date_histogram": {
    "field": "sold",
    "interval": "month",
    "format": "yyyy-MM-dd",
    "order" : { //默认是按照date key升序,我们可以指定为降序
    "_key": "desc"
    }
    }
    }
    }
    }

按汇聚结果排序

假如我们聚合了不通色系的平均售价,然后想要按照售价从第到高输出,那么,就需要指定聚合结果排序,例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"size": 0,
"aggs": {
"colors": {
"terms": {
"field": "color",
"order": { //如果没有这个order,默认是按照不同色系的数量排序,即doc_count字段
"avg_price": "asc"
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "blue",
"doc_count": 2,
"avg_price": {
"value": 20000
}
},
{
"key": "green",
"doc_count": 2,
"avg_price": {
"value": 21000
}
},
{
"key": "red",
"doc_count": 4,
"avg_price": {
"value": 32500
}
}
]
}
}
}

问题来了,如果aggs有嵌套,我们如何按照内层计算结果排序?
order支持这种格式:

1
my_bucket>another_bucket>metric

你可以将所有bucket用尖括号分隔开,即可。例如,我们针对红绿色系的汽车,按照价格2万为一个区间划分,默认情况下,是按照价格从低到高的区间排序输出,我们如果想修改为按照方差(variance)来排序输出,可以这样写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
"size": 0,
"aggs": {
"colors": {
"histogram": {
"field": "price",
"interval": 20000,
"order": { //修改为按照方差排序
"红绿色系的汽车>统计结果.variance": "asc"
}
},
"aggs": {
"红绿色系的汽车": {
"filter": {
"terms": {
"color": [
"red",
"green"
]
}
},
"aggs": {
"统计结果": {
"extended_stats": {
"field": "price"
}
}
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
{
...
"aggregations": {
"colors": {
"buckets": [
{
"key": 80000,
"doc_count": 1,
"红绿色系的汽车": {
"doc_count": 1,
"统计结果": {
"count": 1,
"min": 80000,
"max": 80000,
"avg": 80000,
"sum": 80000,
"sum_of_squares": 6400000000,
"variance": 0,
"std_deviation": 0,
"std_deviation_bounds": {
"upper": 80000,
"lower": 80000
}
}
}
},
{
"key": 0,
"doc_count": 3,
"红绿色系的汽车": {
"doc_count": 2,
"统计结果": {
"count": 2,
"min": 10000,
"max": 12000,
"avg": 11000,
"sum": 22000,
"sum_of_squares": 244000000,
"variance": 1000000,
"std_deviation": 1000,
"std_deviation_bounds": {
"upper": 13000,
"lower": 9000
}
}
}
},
{
"key": 20000,
"doc_count": 4,
"红绿色系的汽车": {
"doc_count": 3,
"统计结果": {
"count": 3,
"min": 20000,
"max": 30000,
"avg": 23333.333333333332,
"sum": 70000,
"sum_of_squares": 1700000000,
"variance": 22222222.22222225,
"std_deviation": 4714.04520791032,
"std_deviation_bounds": {
"upper": 32761.42374915397,
"lower": 13905.242917512693
}
}
}
},
...
]
}
}
}

cardinality

cardinality这个词不好翻译,大体的意思是 distinct count,类似于将数值放入一个set里面,最终计算set的count。
cardinality一个field,表示统计这个filed的值的distinct count,例如:

1
2
3
4
5
6
7
8
9
10
{
"size": 0,
"aggs": {
"销售的汽车都有几种颜色": {
"cardinality": { //如果你想知道,具体每一种颜色,他的销售量,这里应该修改为 terms
"field": "color"
}
}
}
}

结果会告诉你,一共有3种:

1
2
3
4
5
6
7
8
{
...
"aggregations": {
"销售的汽车都有几种颜色": {
"value": 3
}
}
}

大数据的三角裤,只能取其二:
You get to choose two from this triangle:

Exact + real time
Your data fits in the RAM of a single machine. The world is your oyster; use any algorithm you want. Results will be 100% accurate and relatively fast.
Big data + exact
A classic Hadoop installation. Can handle petabytes of data and give you exact answers—but it may take a week to give you that answer.
Big data + real time
Approximate algorithms that give you accurate, but not exact, results.

这个地方用了 HyperLogLog++ (HLL) algorithm,是一种近似算法,并不是绝对exact。可以通过在cardinality内加入precision_threshold来权衡内存消耗与准确性,消耗内存为 precision_threshold * 8 bytes,最大取值 4000。

percentile

按百等分等级排列,也是选用的一个近似算法 TDigest,据说你也可以用”compression”来类似precision_threshold设置精度,但是我在5.3版本没有试验成功,这种近似算法,越靠近两头两尾的越准确。符合找茬的需求(找茬一般都找和主流偏离太远的)
例如,查找网站访问时延。

先创建一个新的索引 website,你可能会遇到Fielddata不能应用到zone错误,请参考cars表的创建,为zone字段添加”fielddata” : true属性:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
POST
website/logs/_bulk
{ "index": {}}
{ "latency" : 100, "zone" : "US", "timestamp" : "2014-10-28" }
{ "index": {}}
{ "latency" : 80, "zone" : "US", "timestamp" : "2014-10-29" }
{ "index": {}}
{ "latency" : 99, "zone" : "US", "timestamp" : "2014-10-29" }
{ "index": {}}
{ "latency" : 102, "zone" : "US", "timestamp" : "2014-10-28" }
{ "index": {}}
{ "latency" : 75, "zone" : "US", "timestamp" : "2014-10-28" }
{ "index": {}}
{ "latency" : 82, "zone" : "US", "timestamp" : "2014-10-29" }
{ "index": {}}
{ "latency" : 100, "zone" : "EU", "timestamp" : "2014-10-28" }
{ "index": {}}
{ "latency" : 280, "zone" : "EU", "timestamp" : "2014-10-29" }
{ "index": {}}
{ "latency" : 155, "zone" : "EU", "timestamp" : "2014-10-29" }
{ "index": {}}
{ "latency" : 623, "zone" : "EU", "timestamp" : "2014-10-28" }
{ "index": {}}
{ "latency" : 380, "zone" : "EU", "timestamp" : "2014-10-28" }
{ "index": {}}
{ "latency" : 319, "zone" : "EU", "timestamp" : "2014-10-29" }

记得上面的POST内容,最后一行得有回车换行符。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
GET
website/logs/_search
{
"size": 0,
"aggs": {
"加载时间分布": {
"percentiles": {
"field": "latency"
}
},
"平均加载时间分布": {
"avg": {
"field": "latency"
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"加载时间分布": {
"values": {
"1.0": 75.55,
"5.0": 77.75,
"25.0": 94.75,
"50.0": 101,
"75.0": 289.75,
"95.0": 489.34999999999985,
"99.0": 596.2700000000002
}
},
"平均加载时间分布": {
"value": 199.58333333333334
}
}
}

默认情况下,是显示[1, 5, 25, 50, 75, 95, 99]刻度百分比,如果想调整刻度,可以在percentiles内部加入自定义刻度,例如:”percents” : [50, 95.0, 99.0]

percentile_ranks

我们再来看看 percentile_ranks,和上面的percentile展示效果是相反的,percentile是想通过输入百分比刻度,得到落入区间的值,而percentile_ranks是输入值,得到落入0到这个值范围的百分比。

例如,我们只对欧洲区的数据,做延时归类,看下,延时为0-90, 0-101, 0-210, 0-800所占百分比:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
GET
website/logs/_search
{
"size": 10,
"query": {
"constant_score": {
"filter": {
"term": {
"zone": "eu"
}
}
}
},
"_source": [ //返回给用户的时候,只显示latency字段,这个_source也可以用于汇聚,比如top_hits的时候指定字段。
"latency"
],
"sort": [ //返回给用户的数据,按照latency的值做一下升序排列
{
"latency": {
"order": "asc"
}
}
],
"aggs": {
"zones": {
"percentile_ranks": {
"field": "latency",
"values": [
90,
101,
210,
800
]
}
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
...
"hits": {
"total": 6,
"max_score": 1,
"hits": [
{
...
"latency": 100
},
{
...
"latency": 155
},
{
...
"latency": 280
},
{
...
"latency": 319
},
{
...
"latency": 380
},
{
...
"latency": 623
}
]
},
"aggregations": {
"zones": {
"values": {
"90.0": 5.303030303030303, //latency小于90的,其实是0%,但是这个显示的是5%,请对比上面的原始数据
"101.0": 8.636363636363637, //latency小于101的,其实是1/6=8%
"210.0": 31.944444444444443, //latency小于210的,其实是2/6=33%
"800.0": 100 //latency小于210的,是6/6=100%,只有这个是准确的。
}
}
}
}

数据关系

{ “tag”: [ “search”, “nosql” ]}
这样的一个值为array的情况,不能用下标去索引,例如tag[0],因为存储到elasticsearch的array是unorder的,可能并不是你存储的时候的顺序。

多级对象

例如,这样一个多级对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
"tweet": "Elasticsearch is very flexible",
"user": {
"id": "@johnsmith",
"gender": "male",
"age": 26,
"name": {
"full": "John Smith",
"first": "John",
"last": "Smith"
}
}
}

其Mapping结构为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"gb": {
"tweet": {
"properties": {
"tweet": { "type": "string" },
"user": {
"type": "object",
"properties": {
"id": { "type": "string" },
"gender": { "type": "string" },
"age": { "type": "long" },
"name": {
"type": "object",
"properties": {
"full": { "type": "string" },
"first": { "type": "string" },
"last": { "type": "string" }
}
}
}
}
}
}
}
}

可以看出,user和name两个key自动映射为了object。

Lucene没有嵌套结构一说,它会将所有对象数据存储到一个平面,变成:

1
2
3
4
5
6
7
8
9
{
"tweet": [elasticsearch, flexible, very],
"user.id": [@johnsmith],
"user.gender": [male],
"user.age": [26],
"user.name.full": [john, smith],
"user.name.first": [john],
"user.name.last": [smith]
}

所以,你可以通过user.name.full这种方式,去获取多级对象的值。

对象数组

咱们的数组里面包含的是对象,例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
PUT
my_index/blogpost/1
{
"title": "Nest eggs",
"body": "Making your money work...",
"tags": [
"cash",
"shares"
],
"comments": [
{
"name": "John Smith",
"comment": "Great article",
"age": 28,
"stars": 4,
"date": "2014-09-01"
},
{
"name": "Alice White",
"comment": "More like this please",
"age": 31,
"stars": 5,
"date": "2014-10-22"
}
]
}

现在,默认的dynamic mapping会将comments当成一个对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
"mappings": {
"blogpost": {
"properties": {
"body": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"comments": {
"properties": {
"age": {
"type": "long"
},
"comment": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
...
}
},
"tags": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}

相同对象名的值,会合并成数组,存储到Lucene内部,应该是这样的:

1
2
3
4
5
6
7
8
9
10
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ],
"comments.name": [ alice, john, smith, white ],
"comments.comment": [ article, great, like, more, please, this ],
"comments.age": [ 28, 31 ],
"comments.stars": [ 4, 5 ],
"comments.date": [ 2014-09-01, 2014-10-22 ]
}

如果我们想搜索,comments里面名字包含Alice,并且年龄为28岁的blog:

1
2
3
4
5
6
7
8
9
10
{
"query": {
"bool": {
"must": [
{ "match": { "comments.name": "Alice" }},
{ "match": { "comments.age": 28 }}
]
}
}
}

从原始数据看,这种搜索是不应该返回结果的,因为Alice的年龄是31,而不是28,但是Lucene将其结构扁平化之后,这个搜索会返回此条记录。

是不是和咱们想想的不一样?于是,有了 Nested Objects.

Nested Objects

我们POST数据之前,先创建索引,并制定其comments的type为nested,则数据变成nested对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
PUT
my_index
{
"mappings": {
"blogpost": {
"properties": {
"comments": {
"type": "nested", //唯一的差别,就在这里,默认type是object
"properties": {
"name": { "type": "string" },
"comment": { "type": "string" },
"age": { "type": "short" },
"stars": { "type": "short" },
"date": { "type": "date" }
}
},
"tags": { //后面会对tags做汇聚,这里先设置fielddata
"type": "text",
"fielddata": true
}
}
}
}
}

然后照常插入数据,再次进行刚才的查询,没了数据,但是换成Alice和31,还是没有数据!
为什么呢?
切换为nested objects后,其查询,并没有无缝切换,需要显示指定查询类型为nested:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
GET
my_index/blogpost/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "eggs"
}
},
{
"nested": { //包裹在nested标签内,指定path为comments
"path": "comments",
"query": {
"bool": {
"must": [
{
"match": {
"comments.name": "Alice" //这里的key必须是全称comments.name不能省略掉comments
}
},
{
"match": {
"comments.age": 31
}
}
]
}
}
}
}
]
}
}
}

数据如我们想象,成功返回。

这些嵌套对象应该是elasticsearch自己扩展的,将嵌套对象和外层对象分开成不同的objects存储,类似这样的存储方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"comments.name": [ john, smith ],
"comments.comment": [ article, great ],
"comments.age": [ 28 ],
"comments.stars": [ 4 ],
"comments.date": [ 2014-09-01 ]
}
{
"comments.name": [ alice, white ],
"comments.comment": [ like, more, please, this ],
"comments.age": [ 31 ],
"comments.stars": [ 5 ],
"comments.date": [ 2014-10-22 ]
}
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ]
}

这样的结果,在指定nested query的时候,elasticsearch可以去一个comment对象里面查找,而不会跨comment查找。nested objects和外层object是属于同一个document。

nested objects有几个特征:

  1. 你不能直接访问这些nested objects。
  2. 增删改任意一个nested object,整个document(外层object和所有nested objects)都会reindex。
  3. 查询匹配上nested object的时候,返回的是整个document,而不是nested objects。
  4. nested objects上match到的score也会累积到document的score上,你可以通过设置score_mode=avg, max, sum, none来选择累积方式,默认为avg方式。

nested objects 排序

假设,我们查询出了很多个documents,如果想安装这些documents的nested objects的某有一个字段进行排序(不止是一个document内部排序,还要跨documents,让documents的顺序也收到nested objects内的某一字段值影响),改如何做?

我们先构造数据,在刚才的基础上,插入第二条数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
PUT
my_index/blogpost/2
{
"title": "Investment secrets",
"body": "What they don't tell you ...",
"tags": [ "shares", "equities" ],
"comments": [
{
"name": "Mary Brown",
"comment": "Lies, lies, lies",
"age": 42,
"stars": 1,
"date": "2014-10-18"
},
{
"name": "John Smith",
"comment": "You're making it up!",
"age": 28,
"stars": 2,
"date": "2014-10-16"
}
]
}

此时,我们的blogpost里面有两个documents,假设我们想查询2014年10月份的评论,然后按照评论者打的星级,排序一下,看下那些文章和言论,获得了用户最糟糕的评价,可以这样写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
"query": {
"nested": {
"path": "comments",
"query": { //官方文档这里写的是filter,我在5.3版本运行提升错误,改为query
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
},
//上面的query比较好理解,就是查询出了所有包含2014年10月份评论的documents
"sort": {
"comments.stars": {
"order": "asc",
"mode": "min",
"nested_filter": { //这里,为什么还要filter一次?原因是上面返回的是documents,这里接收到的也是documents,再过滤一下时间段,将2014年10月份的nested objects过滤出来,不然,可能2014年5月份更低的comments.stars也被用于排序,显然不是我们想要的。
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
}
}

nested objects 聚合

如果想要按照nested objects聚合,需要指定nested方法并指定path参数。
如下,我们想根据不同月份的comments聚合,然后看一下每月评分分布情况,可以这样查询:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
GET
my_index/blogpost/_search
{
"size": 0,
"aggs": {
"评论": {
"nested": { //这里指定是嵌套聚合comments
"path": "comments"
},
"aggs": {
"不同月份的评分": {
"date_histogram": {
"field": "comments.date", // field的值要带上comments本身作为路径的一部分
"interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"评分分布详情": {
"terms": {
"field": "comments.stars"
}
},
"平均评分": {
"avg": {
"field": "comments.stars"
}
}
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"评论": {
"doc_count": 4,
"不同月份的评分": {
"buckets": [
{
"key_as_string": "2014-09",
"key": 1409529600000,
"doc_count": 1,
"评分分布详情": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 4,
"doc_count": 1
}
]
},
"平均评分": {
"value": 4
}
},
{
"key_as_string": "2014-10",
"key": 1412121600000,
"doc_count": 3,
"评分分布详情": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1
},
{
"key": 2,
"doc_count": 1
},
{
"key": 5,
"doc_count": 1
}
]
},
"平均评分": {
"value": 2.6666666666666665
}
}
]
}
}
}
}

reverse_nested

和global方法有异曲同工之妙,如果进入了里层聚合,如何跳出来,使用外层的数据?
global是不管当前的查询结果,从整个index或type重新开始新的查询,而reverse_nested是不管当前的查询结果,从document重新开始新的查询。
我们看看,假设,我们按照评论者的年龄段划分,划分后,我们想知道,不通年龄段的评论者,对哪类型的文章感兴趣?聚合语句可以这样写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
GET
my_index/blogpost/_search
{
"size": 0,
"aggs": {
"评论": {
"nested": {
"path": "comments"
},
"aggs": {
"年龄分布": {
"histogram": {
"field": "comments.age",
"interval": 10
},
"aggs": {
"博客": {
"reverse_nested": {},
"aggs": {
"文章标签": {
"terms": {
"field": "tags"
}
}
}
}
}
}
}
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
{
...
"aggregations": {
"评论": {
"doc_count": 4,
"年龄分布": {
"buckets": [
{
"key": 20, // 年龄在20-30岁之间的评论,有2条,他们感兴趣的tags是shares, cash, equities
"doc_count": 2,
"博客": {
"doc_count": 2,
"文章标签": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "shares",
"doc_count": 2
},
{
"key": "cash",
"doc_count": 1
},
{
"key": "equities",
"doc_count": 1
}
]
}
}
},
{
"key": 30,
"doc_count": 1,
"博客": {
"doc_count": 1,
"文章标签": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cash",
"doc_count": 1
},
{
"key": "shares",
"doc_count": 1
}
]
}
}
},
{
"key": 40,
"doc_count": 1,
"博客": {
"doc_count": 1,
"文章标签": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "equities",
"doc_count": 1
},
{
"key": "shares",
"doc_count": 1
}
]
}
}
}
]
}
}
}
}

Parent/child relationships

nested objects和Parent/child的差别在于,前者是所有的entities都存在同一个document中,而后者parent和child的entities分属于不同的documents。
但是,elasticsearch仍然会保证,parent/child被分到同一shard。
相对于nested objects,其优点是,更新某一child不会影响到parent和其他children,可单独更新,也不会导致parent/child全部reindex。child documents可以单独在查询结果中返回(nested objects是要返回外层document的)。

我们演示一下,三级父子关系的情况,假设,一个国家下面有不同的分支机构,不同的分支机构下面有不同的雇员,其index可以创建的时候,可以将几个type之间的关系指定好:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
PUT
company
{
"mappings": {
"country": {
"properties": {
"name": {
"type": "string",
"fielddata": true
}
}
},
"branch": {
"_parent": { // branch的parent为country
"type": "country"
}
},
"employee": {
"_parent": { // employee的parent为branch
"type": "branch"
},
"properties": {
"hobby": {
"type": "string",
"fielddata": true
}
}
}
}
}

插入country:

1
2
3
4
5
6
POST
company/country/_bulk
{ "index": { "_id": "uk" }}
{ "name": "UK" }
{ "index": { "_id": "france" }}
{ "name": "France" }

指定了parent之后,默认的routing会设置为parent的key,根据shard算法:
shard = hash(routing) % number_of_primary_shards
可以将parent和child放到同一shard。

插入branch,可以通过单条方式插入(需要指定parent),也可以通过bulk方式批量插入:

1
2
3
4
5
6
7
8
POST
company/branch/_bulk
{ "index": { "_id": "london", "parent": "uk" }}
{ "name": "London Westmintster" }
{ "index": { "_id": "liverpool", "parent": "uk" }}
{ "name": "Liverpool Central" }
{ "index": { "_id": "paris", "parent": "france" }}
{ "name": "Champs élysées" }

插入employee,可以一个一个插入(需要指定parent,将route指定为grandparent的ID,可以保证子孙三代都放到同一个shard):

1
2
3
4
5
6
7
PUT
company/employee/1?parent=london&routing=uk
{
"name": "Alice Smith",
"dob": "1970-10-24",
"hobby": "hiking"
}

或者使用bulk API批量插入:

1
2
3
4
5
6
7
8
9
10
POST
company/employee/_bulk
{ "index": { "_id": 1, "parent": "london", "routing":"uk" }}
{ "name": "Alice Smith", "dob": "1970-10-24", "hobby": "hiking" }
{ "index": { "_id": 2, "parent": "london", "routing":"uk" }}
{ "name": "Mark Thomas", "dob": "1982-05-16", "hobby": "diving" }
{ "index": { "_id": 3, "parent": "liverpool", "routing":"uk" }}
{ "name": "Barry Smith", "dob": "1979-04-01", "hobby": "hiking" }
{ "index": { "_id": 4, "parent": "paris", "routing":"france" }}
{ "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" }

has_child查询

查询父子关系的时候,使用has_child方法。
例如,查询,那些国家的佣员喜欢徒步旅行,可以这样:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
GET
company/country/_search
{
"query": {
"has_child": {
"type": "branch",
"query": {
"has_child": {
"type": "employee",
"query": {
"match": {
"hobby": "hiking"
}
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
...
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "company",
"_type": "country",
"_id": "uk",
"_score": 1,
"_source": {
"name": "UK"
}
}
]
}
}

同样可以通过设置score_mode=avg, max, sum, none来选择score累积方式,默认为none方式。

has_parent

了解了has_child后,has_parent自然是与之相反的,查询children的时候,看下是否有这样一个父母,例如,我们想查询所有属于London分支机构的employee

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET
company/employee/_search
{
"query": {
"has_parent": {
"type": "branch",
"score_mode": "none", //默认score_mode为none,如果改为score,则会计算score。
"query": {
"match": {
"name": "london"
}
}
}
}
}

同样可以通过设置score_mode=score, none来选择score累积方式,默认为none方式。has_child计算score是非常耗时的,所以推荐保持默认none。

min_children and max_children

两个方法,都是用于匹配children的数量,比如,匹配 min_children <= children_count <= max_children

1
2
3
4
5
6
7
8
9
10
11
12
13
GET
company/branch/_search
{
"query": {
"has_child": {
"type": "employee",
"min_children": 2,
"query": {
"match_all": {}
}
}
}
}

children aggregation

对标nested objects,我们可以用 nested 方法来对嵌套对象进行聚合,parent/child关系中,我们可以用children关键字,来对children进行聚合,但是,我还没找到如何对孙子节点聚合的方法。
此外,nested objects有reverse_nested,parent/child关系中没有,即,在children聚合范围内,没办法回溯到parent级别。
假如,我们想知道不同分支机构佣员的兴趣还好情况,可以这样写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
GET
company/branch/_search
{
"size": 0,
"aggs": {
"分支机构": {
"terms": {
"field": "name"
},
"aggs": {
"佣员列表": {
"children": {
"type": "employee"
},
"aggs": {
"兴趣爱好": {
"terms": {
"field": "hobby"
}
}
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
{
...
"aggregations": {
"分支机构": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
...
{
"key": "liverpool",
"doc_count": 1,
"佣员列表": {
"doc_count": 1,
"兴趣爱好": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hiking",
"doc_count": 1
}
]
}
}
},
{
"key": "london", //解读一下便是,在London工作的员工,有两类兴趣爱好,一是diving二是hiking,各自占一人。
"doc_count": 1,
"佣员列表": {
"doc_count": 2,
"兴趣爱好": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "diving",
"doc_count": 1
},
{
"key": "hiking",
"doc_count": 1
}
]
}
}
},
...

Parent/child方式还有个优点就是,可以单独返回children部分,而nested objects是返回整个document。当然,Parent/child的查询有的方面也要弱一些,比如reverse_nested方法并没有支持,而且性能上,有差别:相对search-time的性能,如果你很在意index-time(插入更新建)性能,那选择Parent/child方式就比nested objects方式更好。但如果在意search-time,最好还是选择nested objects方式,其性能是Parent/child方式的5-10倍。

Parent/child适合有少量parent,有大量children的场景。
而nested objects适合,children比较少的场景。

Parent/child是通过global ordinals方式建立连接的。

缺点:
层次越多,耗在join上的时间就越多,每一代的parents都需要将自己的_id存储到内存,会消耗大量内存,

global ordinals

默认都是lazy模式,就是有查询或者聚合的时候,才创建 global ordinals,但是,你可以通过索引字段设置:

1
2
3
"fielddata": {
"loading": "eager_global_ordinals"
}

来改变这一行为,让某字段的global ordinals在更新数据的时候就会创建(如果一直更新,每refresh_interval超时周期便会更新一次,比较消耗性能)