Elasticsearch 数据分表性能验证

假设一次查询某表,query的时候term里有两个主要字段,rule_minor_type和rule_major_type,这两个字段取值不同,表示不同类型的数据,例如,我们取一组值,叫做风险数据,风险数据相对于原表来说,数据量要小很多(原表可能是1一条数据,而风险数据只有1百万条)。

我们想要验证,提前将这部分风险数据分到单独的表,能否加快查询速度,如果可以,可以加快多少?

要做上面的验证,主要使用的API是reindex API,API用法请参考官网,你可以用curl来发送RESTFul API请求,但是命令行操作起来不太方便,我们之前介绍过Elasticsearch plugin cerebro的安装,今天我们用cerebro来发送API请求,因为我的Elasticsearch是配置在其他机器的,之前也没有开远程访问和cors支持,需要按照上面文章所说,开启cors支持,然后浏览cerebro所在主机的9000端口,输入Elasticsearch服务器的访问地址,打开REST窗口即可使用:
cerebro

原查询

原来查询7天数据,条件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
GET
test-2017-09-21,test-2017-09-22,test-2017-09-23,test-2017-09-24,test-2017-09-25,test-2017-09-26,test-2017-09-27/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"range": {
"record_time": {
"gte": 1505923200,
"lte": 1506527999
}
}
},
{
"term": {
"rule_minor_type": 40
}
},
{
"term": {
"rule_major_type": 1
}
},
{
"term": {
"exclude_type": 0
}
},
]
}
}
}
},
"aggs": {
"total_count": {
"cardinality": {
"script": {
"inline": "doc['src_ip'].value + '|' + doc['src_branch_id'].value",
"lang": "painless"
},
"precision_threshold": 5000
}
}
},
"size": 0
}

返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"took": 663,
"timed_out": false,
"_shards": {
"total": 7,
"successful": 7,
"failed": 0
},
"hits": {
"total": 2826534,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"total_count": {
"value": 4688
}
}
}

这个耗时并不准确,因为有Cache。

我们先看看,这几张表的总数据量是多少:

1
2
3
4
5
6
7
GET
test-2017-09-21,test-2017-09-22,test-2017-09-23,test-2017-09-24,test-2017-09-25,test-2017-09-26,test-2017-09-27/_search
{
"query": {
"match_all": {}
}
}

返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
"took": 637,
"timed_out": false,
"_shards": {
"total": 7,
"successful": 7,
"failed": 0
},
"hits": {
"total": 608212533,
"max_score": 1,
"hits":
...

好了,原本是,6亿条(608212533)总数据,风险数据为280万条(2826534)。

重建索引

重建索引的时候,我们一个索引一个索引的来,可以先看看,某个索引下的风险数据有多少条:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET
test-2017-09-22/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"rule_minor_type": 40
}
},
{
"term": {
"rule_major_type": 1
}
}
]
}
}
}

返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"took": 152,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 381551,
"max_score": 2,
"hits": [
{
"_index": "test-2017-09-22",
...

可以看出,这个一个索引下的风险数据为38万条(381551),有了这个数据,重建索引完成后,可以对照一下,是否条数一至,以检测索引创建是否正确。

调用reindex之前,我们可以先创建索引,并设置索引的参数,例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PUT
test-104-2017-11-11
{
"mappings": {
"access": {
"include_in_all": false
}
},
"settings": {
"index": {
"refresh_interval": "60s",
"number_of_shards": "1",
"store": {
"type": "mmapfs"
},
"number_of_replicas": "0"
}
}
}

因为之前注册过test-*的模板,此时创建一个新index,自动会根据index名称,匹配上模板。当然,如果没有修改设置的需求,这个步骤可以不用,直接针对原索引做filter重建:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
POST
_reindex
{
"source": {
"index": "test-2017-09-27",
"query": {
"bool": {
"must": [
{
"term": {
"rule_minor_type": 40
}
},
{
"term": {
"rule_major_type": 1
}
}
]
}
}
},
"dest": {
"index": "test-104-2017-09-27"
}
}

返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"took": 410,
"timed_out": false,
"total": 1250,
"updated": 0,
"created": 1250,
"deleted": 0,
"batches": 2,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0,
"failures": [
]
}

如此重复,将不同日期的索引全部 reindex:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# curl http://127.0.0.1:9200/_cat/indices?v 2>/dev/null | grep "test-.*2017-09-2[1-7]" | sort -k 3
green open test-104-2017-09-22 b_9xSxHgTNO49t4Rjn62Pw 1 0 381551 0 138.6mb 138.6mb
green open test-104-2017-09-23 tOopBSYdTYGyv9BdsuITrA 1 0 32800 0 12.6mb 12.6mb
green open test-104-2017-09-24 2_WMS_qwQ0qKWKTiN80xig 1 0 939358 0 335mb 335mb
green open test-104-2017-09-25 04DGhnIKSAKteI1wGX4KXg 1 0 836475 0 299.8mb 299.8mb
green open test-104-2017-09-26 UqWD6MSFS_ytTUm69JYKHA 1 0 635100 0 228.3mb 228.3mb
green open test-104-2017-09-27 8lADVH3NTMmZ28Qq-xq75w 1 0 1250 0 521.3kb 521.3kb
green open test-2017-09-21 oUN09yl7SGq-Gv9C71ZtsQ 1 0 178000 0 63.9mb 63.9mb
green open test-2017-09-22 lwTULw_YSiSsY6MtVM4p_A 1 0 80708969 0 26.9gb 26.9gb
green open test-2017-09-23 g60GbCCcT1mqY6AhTBM6zQ 1 0 6098400 0 1.9gb 1.9gb
green open test-2017-09-24 wV-Jr-UaSYGpiBj-BJPbVw 1 0 208154658 0 70.4gb 70.4gb
green open test-2017-09-25 SMaa0TDASJeDd_nESROOjQ 1 0 176896598 0 60.3gb 60.3gb
green open test-2017-09-26 C3pWVq2CTuiF1UnV-wjcSw 1 0 136174658 0 45.1gb 45.1gb
green open test-2017-09-27 4GojoCCHTC6wbDxFbog7cw 1 0 1250 0 565.9kb 565.9kb

新查询

从新创建的风险表查询数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
GET
test-104-2017-09-22,test-104-2017-09-23,test-104-2017-09-24,test-104-2017-09-25,test-104-2017-09-26,test-104-2017-09-27/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"range": {
"record_time": {
"gte": 1505923200,
"lte": 1506527999
}
}
},
{
"term": {
"exclude_type": 0
}
},
]
}
}
}
},
"aggs": {
"total_count": {
"cardinality": {
"script": {
"inline": "doc['src_ip'].value + '|' + doc['src_branch_id'].value",
"lang": "painless"
},
"precision_threshold": 5000
}
}
},
"size": 0
}

返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"took": 577,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 2826534,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"total_count": {
"value": 4688
}
}
}

结论

如果磁盘数据读取量不大,查询结果差异不大,反之,由于首次查询需要载入打量磁盘数据,所以首次查询差异明显:
result-chart