文件替换导致进程内存损坏

某客户的一台机器,无法访问文件上传端口(端口4430),但是可以访问webui端口(端口443),两者都是Apache提供的服务,只是运行时指定了不同的配置文件。

疑问1:为何443端口可以访问而4430端口无法访问?

其运行方式如:

1
httpd -f 无法访问配置.conf

排除-网络原因

先在本机访问4430,连接上了便没了下文:

1
2
3
~ $ wget --no-check-certificate https://127.0.0.1:4430/
--2018-12-19 16:36:04-- https://127.0.0.1:4430/
Connecting to 127.0.0.1:4430... connected.

由于本机也不正常,基本可排除网络方面导致的原因,例如防火墙等。

定位-发生进程

查看4430关联的进程:

1
2
~ $ netstat -nap | grep 4430
tcp6 129 0 :::4430 :::* LISTEN 652/httpd

关联进程652,其启动时间为2018-12-04 09:15:09,最开始出现SIGSEGV是在2018-12-10 11:50:24,当前时间为2018-12-20,说明进程运行了很多天了。

查看此apache进程的error日志,不停出现类似错误:

1
2
3
4
5
[Mon Dec 10 11:50:24 2018] [notice] child pid 24485 exit signal Segmentation fault (11)
[Mon Dec 10 11:50:26 2018] [notice] child pid 24487 exit signal Segmentation fault (11)
[Mon Dec 10 11:55:26 2018] [notice] child pid 29896 exit signal Segmentation fault (11)
[Mon Dec 10 12:07:08 2018] [notice] child pid 36650 exit signal Segmentation fault (11)
[Mon Dec 10 12:56:40 2018] [notice] child pid 6655 exit signal Segmentation fault (11)

通过strace -FfTt关联,发现也是不停输出SIGSEGV。

没有coredump,所以通过gdb调试此进程,由于出core的是在子进程,需要“set follow-fork-mode child”,以便调试子进程。
发现没有glibc调试符号,安装:

1
debuginfo-install glibc

进入gdb:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
~ $ gdb -p 652
(gdb) set follow-fork-mode child
(gdb) c
Continuing.
[New process 5685]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f43a0379740 (LWP 5685)]
(gdb) bt full n
#0 strcmp () at ../sysdeps/x86_64/strcmp.S:135
No locals.
#1 0x00007f43a0171b13 in _dl_map_object (loader=0x7f43a037d4c8, name=name@entry=0x7fff36cbba80 "libnss_sss.so.2", type=type@entry=2,
trace_mode=trace_mode@entry=0, mode=mode@entry=-1879048191, nsid=0) at dl-load.c:2111
soname = <optimized out>
fd = <optimized out>
realname = 0x5 <Address 0x5 out of bounds>
l = 0x2715810
fb = {len = 139928387339192,
buf = "H\263\313\066\377\177\000\000\340\370h\002\000\000\000\000\004\000\000\000\000\000\000\000\224%\027\240C\177\000\000\000\000\000\000\000\000\000\000\030\215\274\236C\177\000\000\t\000\000\000\000\000\000\000E\000\000\000\000\000\000\000\207\360\226|\000\000\000\000\313,\027\240C\177\000\000\000\000\000\000\000\000\000\000`\263\313\066\377\177\000\000\350\214\274\236C\177\000\000\350[\275\236C\177\000\000p\264\313\066\377\177\000\000`\264\313\066\377\177\000\000\a\000\000\000\000\000\000\000\000\232\017\236C\177\000\000\000\000\000\000\000\000\000\000\210\344d\002\000\000\000\000\310\324\067\240C\177\000\000]S\327\232C\177\000\000\b]\275\236C\177\000\000\360E\327\232C\177\000\000\000\000\000\000\001\000\000\000"...}
found_other_class = 96
stack_end = 0x7f439de9dee6
#2 0x00007f43a017c891 in dl_open_worker (a=a@entry=0x7fff36cbb848) at dl-open.c:228
args = 0x7fff36cbb848
file = 0x7fff36cbba80 "libnss_sss.so.2"
mode = -2147483647
call_map = <optimized out>
dst = <optimized out>
new = <optimized out>
r = <optimized out>
reloc_mode = <optimized out>
nmaps = <optimized out>
l = <optimized out>
maps = <optimized out>
relocation_in_progress = <optimized out>
any_tls = <optimized out>
first_static_tls = <optimized out>
#3 0x00007f43a01782f4 in _dl_catch_error (objname=objname@entry=0x7fff36cbb838, errstring=errstring@entry=0x7fff36cbb840,
mallocedp=mallocedp@entry=0x7fff36cbb830, operate=operate@entry=0x7f43a017c770 <dl_open_worker>, args=args@entry=0x7fff36cbb848) at dl-error.c:177
errcode = <optimized out>
old = 0x7fff36cbb930
c = {objname = 0x7f439ad75070 "", errstring = 0x0, malloced = true, env = {{__jmpbuf = {-2, -962239387103018190, 2147483649, 140734112709248,
140734112710472, 4, -962239386769571022, -874721043931610318}, __mask_was_saved = -1609093201, __saved_mask = {__val = {0, 0, 2, 0, 0,
40169120, 140734112708640, 139928402693400, 9, 69, 0, 40170024, 140734112708640, 140734112708624, 2098210535, 140734112709232}}}}}
catchp = 0x7f43a0379738
#4 0x00007f43a017c21b in _dl_open (file=0x7fff36cbba80 "libnss_sss.so.2", mode=-2147483647, caller_dlopen=<optimized out>, nsid=-2, argc=4,
argv=0x7fff36cbbf48, env=0x7fff36cbbf70) at dl-open.c:656
args = {file = 0x7fff36cbba80 "libnss_sss.so.2", mode = -2147483647, caller_dlopen = 0x7f439eccd259 <__GI___nss_lookup_function+649>,
caller_dl_open = 0x7f439ecf5962 <do_dlopen+66>, map = 0x0, nsid = 0, argc = 4, argv = 0x7fff36cbbf48, env = 0x7fff36cbbf70}
objname = 0x7f43a0172ccb <do_lookup_x+1803> "H\205\300L\213L$\030L\213\\$(D\213D$0\017\205\217\372\377\377H\213T$\020\213\n\353\204I\213\003\276\377\377\377\377H9\360\017\204\353"
errstring = 0x7fff36cbb8a0 "\020\266\211\002"
malloced = 70
errcode = <optimized out>
#5 0x00007f439ecf5962 in do_dlopen (ptr=ptr@entry=0x7fff36cbba50) at dl-libc.c:87
args = 0x7fff36cbba50
#6 0x00007f43a01782f4 in _dl_catch_error (objname=0x7fff36cbba30, errstring=0x7fff36cbba40, mallocedp=0x7fff36cbba20,
operate=0x7f439ecf5920 <do_dlopen>, args=0x7fff36cbba50) at dl-error.c:177
errcode = <optimized out>
old = 0x0
c = {objname = 0x20 <Address 0x20 out of bounds>, errstring = 0x0, malloced = 3, env = {{__jmpbuf = {40167456, -962239387168029902, 42579472,
40167504, 140734112709264, 42579504, -962239387105115342, -874721043931610318}, __mask_was_saved = 119, __saved_mask = {__val = {24,
18446603339596842481, 139925739536386, 0, 139925739536432, 0, 140734112709136, 0, 472446402651, 0, 0, 532575944823, 140734112709135,
0, 40167496, 139928403202444}}}}}
catchp = 0x7f43a0379738
#7 0x00007f439ecf5a22 in dlerror_run (args=0x7fff36cbba50, operate=0x7f439ecf5920 <do_dlopen>) at dl-libc.c:46
objname = 0x7fff36cbba80 "libnss_sss.so.2"
last_errstring = 0x0
malloced = false
result = <optimized out>
#8 __GI___libc_dlopen_mode (name=name@entry=0x7fff36cbba80 "libnss_sss.so.2", mode=mode@entry=-2147483647) at dl-libc.c:163
args = {name = 0x7fff36cbba80 "libnss_sss.so.2", mode = -2147483647, caller_dlopen = 0x7f439eccd259 <__GI___nss_lookup_function+649>,
map = 0x7f439ec4518c <__GI___libc_malloc+92>}
#9 0x00007f439eccd259 in nss_load_library (ni=0x264e820) at nsswitch.c:358
shlen = <optimized out>
saved_errno = 115
shlib_name = 0x7fff36cbba80 "libnss_sss.so.2"
ni = 0x264e820
#10 __GI___nss_lookup_function (ni=ni@entry=0x264e820, fct_name=fct_name@entry=0x7f439ed4134d "initgroups_dyn") at nsswitch.c:455
known = 0x289b610
found = <optimized out>
result = <optimized out>
#11 0x00007f439ec7f2ca in internal_getgrouplist (user=user@entry=0x262b188 "daemon", group=group@entry=2, size=size@entry=0x7fff36cbbb60,
groupsp=groupsp@entry=0x7fff36cbbb70, limit=65536) at initgroups.c:107
prev_start = 1
fct = <optimized out>
cnt = <optimized out>
status = <optimized out>
no_more = 0
start = 1
nip = 0x264e820
#12 0x00007f439ec7f5da in initgroups (user=user@entry=0x262b188 "daemon", group=2) at initgroups.c:220
size = 64
groups = 0x289b480
ngroups = <optimized out>
result = <optimized out>
limit = <optimized out>
#13 0x0000000000466211 in set_group_privs () at unixd.c:103
name = 0x262b188 "daemon"
#14 unixd_setup_child () at unixd.c:117
No locals.
#15 0x0000000000464815 in child_main (child_num_arg=child_num_arg@entry=0) at prefork.c:514
ptrans = 0x2899498
allocator = 0x2897390
status = <optimized out>
i = <optimized out>
lr = <optimized out>
pollset = 0x7fff36cbbb00
sbh = 0x1
bucket_alloc = <optimized out>
last_poll_idx = 0
#16 0x0000000000464ef4 in make_child (s=0x25ec158, slot=<optimized out>) at prefork.c:768
pid = 0
slot = 0
s = 0x25ec158
#17 0x0000000000465dfa in perform_idle_server_maintenance (p=<optimized out>) at prefork.c:903
i = <optimized out>
idle_count = <optimized out>
ws = <optimized out>
free_length = <optimized out>
free_slots = {0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 0 <repeats 16 times>}
last_non_dead = <optimized out>
total_non_dead = <optimized out>
#18 ap_mpm_run (_pconf=_pconf@entry=0x25e5138, plog=<optimized out>, s=s@entry=0x25ec158) at prefork.c:1107
status = 11
pid = {pid = -1, in = 0x7f439c4579c0 <php_post_entries+32>, out = 0x25ec158, err = 0x25e5138}
child_slot = <optimized out>
exitwhy = APR_PROC_SIGNAL
processed_status = <optimized out>
index = <optimized out>
remaining_children_to_start = 0
rv = <optimized out>
#19 0x0000000000425f3b in main (argc=4, argv=0x7fff36cbbf48) at main.c:753
c = 68 'D'
configtestonly = <optimized out>
confname = 0x7fff36cbcab2 "input.httpd.conf"
def_server_root = 0x46f348 "/environment/build/apache"
temp_error_log = <optimized out>
error = <optimized out>
process = 0x25e3218
server_conf = 0x25ec158
pglobal = <optimized out>
pconf = 0x25e5138
plog = 0x25f1198
ptemp = 0x25ef188
pcommands = 0x25e7148
opt = 0x25e7238
rv = <optimized out>
mod = <optimized out>
optarg = 0x7fff36cbcaf6 "FOREGROUND"
signal_server = <optimized out>

进程死在initgroups里面,看一下这个代码:
http://lcs.ios.ac.cn/~xuzb/canalyze/reports/httpd-2.4.4/inline_report/D1010-103-1.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
static int set_group_privs(void)
{
if (!geteuid()) {
const char *name;
/* Get username if passed as a uid */
if (ap_unixd_config.user_name[0] == '#') {
struct passwd *ent;
uid_t uid = atol(&ap_unixd_config.user_name[1]);
if ((ent = getpwuid(uid)) == NULL) {
ap_log_error(APLOG_MARK, APLOG_ALERT, errno, NULL, APLOGNO(02155)
"getpwuid: couldn't determine user name from uid %ld, "
"you probably need to modify the User directive",
(long)uid);
return -1;
}
name = ent->pw_name;
}
else
name = ap_unixd_config.user_name;
#if !defined(OS2)
/*
* Set the GID before initgroups(), since on some platforms
* setgid() is known to zap the group list.
*/
if (setgid(ap_unixd_config.group_id) == -1) {
ap_log_error(APLOG_MARK, APLOG_ALERT, errno, NULL, APLOGNO(02156)
"setgid: unable to set group id to Group %ld",
(long)ap_unixd_config.group_id);
return -1;
}
// in initgroups (user=user@entry=0x262b188 "daemon", group=2)
// user如果为NULL是可能导致异常的,但这里两个参数都是正确的。
if (initgroups(name, ap_unixd_config.group_id) == -1) {
ap_log_error(APLOG_MARK, APLOG_ALERT, errno, NULL, APLOGNO(02157)
"initgroups: unable to set groups for User %s "
"and Group %ld", name, (long)ap_unixd_config.group_id);
return -1;
}
#endif
}
return 0;
}

载入的so文件libnss_sss.so.2我也和正常环境对比了,两者并没有什么差别。

上面的堆栈,只有一个地方又疑点:

1
2
3
realname = 0x5 <Address 0x5 out of bounds>
l = 0x2715810
fb = {len = 139928387339192,

这里realname显然不是一个合法值,len看上去也不像,有点像踩内存。

第一反应是,作为apache进程,我们和他没有直接交集,应该不至于踩内存,网上也搜索不到死在initgroups里的样例。

疑问2:我们都是PHP或HTML代码,为何会导致Apache看上去有踩内存的现象?

重现

为了不破坏现场,我重现修改了一份conf文件,监听4433端口,再用相同命令启动,发现运行得很好。
并不能重现,所以原本的现场更不能破坏。

推理起来就有点不合逻辑了,进程启动了好几天,刚开始没事,中间某个时间点开始出故障,新启动的进程又没有这个问题?

环境变量

如果经常是人为启动的,可能会导致差异性,看了一下环境变量,并无多少差别:

1
cat /proc/父进程和子进程/environ | tr '\0' '\n'

黑盒查验

通过黑盒发现两个重要信息,其一,在2018-12-10这天,11:10分有用户登录过:

1
2
3
root root 14 1970 1 0.0 0.0 103796 3740 3740 ? Ss Nov 27 00:00:00 /usr/sbin/sshd -D
root root 15 17396 1970 0.0 0.0 148156 5688 5688 ? Ss 11:10:38 00:00:00 \_ sshd: admin@pts/0
root root 5 17399 17396 0.0 0.0 11876 2036 2036 pts/0 Ss+ 11:10:38 00:00:00 \_ -bash

而第一次出现SIGSEGV的时间是在2018-12-10 11:50:24。

两者时间相隔很近,而且此用户登出的时间在12点以后。

一个关键结论:在此用户操作期间,出现了Apache SIGSEGV。

我们再看另外一个443端口的黑盒日志,其加入了心跳监控机制,如果长时间不响应,会自动拉起,从黑盒日志看,443端口所在进程,在2018-12-10 12:02:16自动拉起过。

1
Mon Dec 10 12:02:16 CST 2018 Serivce httpd exit innormally restart by wget suspended ...

这下可以解释:

疑问1:为何443端口可以访问而4430端口无法访问?
因为443端口有监控机制,当天发现出问题后就重启了,之前我们也验证了,重新开进程是不会有错误现象的。

这下的重点放到查此用户在登陆期间做了啥?

由于没有history,不是很方便,还好能问人,两个人一个替换了PHP,一个替换了一个非关联模块。

查看dmesg发现如下日志:

1
2
3
[1102467.408054] app1[3664]: segfault at 2e77520 ip 00007fd766df9ad0 sp 00007fff2a2bd7d8 error 4 in libfoo.so[7fd766df0000+17000]
[1102469.849633] app2[10401]: segfault at 6a06 ip 0000000000006a06 sp 00007fff6cf59a70 error 14 in app2 (deleted)[400000+2e000]
^Mon Dec 10 11:21:03 CST 2018

时间上也比较吻合。

此日志说明推测,由于替换了libfoo.so,导致app1, app2进程挂掉。

由此找到关键点,是否因为替换了此文件,导致Apache进程异常?

问题验证

之前还查看过apache加载的mod,mod目录没有文件替换。
是我们自己开发的一个插件间接依赖的模块。

使用info sharedlibrary查看进程当前加载的动态库,发现的确有libfoo.so存在。
子进程内存布局是继承的父进程里老的libfoo.so的布局,而实际去取文件代码段的时候,就会映射成错误代码,导致代码或内存corrupt。

于是写了一段代码,查看都有哪些进程会出现载入动态库比进程启动时间更新:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
for pid in $(ps ax --no-headers -o pid); do
# 进程启动时间
startTime=$(date +"%s" --date="$(ps --no-headers -o lstart -p $pid)")
startTimePrint="$(date +"%F %T" --date=@$startTime)"
# 获取进程加载的so,用gdb info sharedlibrary要准确一些
for i in $(lsof -p $pid | grep '\.so' | awk '{print $NF}' | sort -k 1); do
if [ -f "$i" ]; then
# 文件更改时间
fileTime=$(stat -c "%Y" $i)
if [ $fileTime -ge $startTime ]; then
fileTimePrint="$(date +"%F %T" --date=@$fileTime)"
cmdline="$(ps --no-headers -o args -p $pid)"
echo "Process $startTimePrint < File $fileTimePrint $pid $i IN $cmdline"
fi
fi
done
done

执行后发现还有个进程也有此现象:

1
2
3
4
5
6
Process 2018-12-04 09:15:09 < File 2018-12-10 11:20:19 639 /usr/lib64/libfoo.so IN httpd -f sec_workorder/input.httpd.conf -DFOREGROUND
Process 2018-12-04 09:15:09 < File 2018-12-10 11:20:19 640 /usr/lib64/libfoo.so IN httpd -f seclib/input.httpd.conf -DFOREGROUND
Process 2018-12-04 09:15:09 < File 2018-12-10 11:20:19 652 /usr/lib64/libfoo.so IN httpd -f ngfw/input.httpd.conf -DFOREGROUND
Process 2018-12-04 09:15:09 < File 2018-12-10 11:20:19 791 /usr/lib64/libfoo.so IN httpd -f sec_workorder/input.httpd.conf -DFOREGROUND
Process 2018-12-04 09:15:09 < File 2018-12-10 11:20:19 792 /usr/lib64/libfoo.so IN httpd -f sec_workorder/input.httpd.conf -DFOREGROUND
Process 2018-12-10 01:01:44 < File 2018-12-10 11:20:19 38826 /usr/lib64/libfoo.so IN php -c php.ini index.php get_data

用wget验证,现象相同:

1
2
3
4
~ $ wget --no-check-certificate https://127.0.0.1:4488/
--2018-12-19 15:33:34-- https://127.0.0.1:4488/
Connecting to 127.0.0.1:4488... connected.
Unable to establish SSL connection.

这下可以解释:

疑问2:我们都是PHP或HTML代码,为何会导致Apache看上去有踩内存的现象?
因为libfoo.so最终间接加载进入了Apache进程,导致内存错乱,的确并非由PHP或HTML代码引起。