Go Cmd 服务无法退出的小坑-清一色财经

上家公司的案例。先说下使用背景，服务在每台服务器上启动 agent, 用户会在指定机器上执行任务，并将结果返回到网页上。执行任务由用户自定义脚本，一般也都是 shell 或是 python，会不断的产生子进程，孙进程，直到执行完毕或是超时被 kill。

[[409900]]

本文转载自微信公众号「董泽润的技术笔记」，作者董泽润。转载本文请联系董泽润的技术笔记公众号。

问题

最近发现经常有任务，一直处于运行中，但实际上己经超时被 kill，并未将输出写到系统，看不到任务的执行情况

登录机器，发现执行脚本进程己经杀掉，但是有子脚本卡在某个 http 调用。再看下这个脚本，python requests 默认没有设置超时...

总结一下现象：agent 用 go cmd 启动子进程，子进程还会启动孙进程，孙进程因某种原因阻塞。此时，如果子进程因超时被 agent kill 杀掉, agent 却仍然处于 wait 状态

复现

环境是 go version go1.16.5 linux/amd64, agent 使用 exec.CommandContext 启动任务，设置 ctx 超时 30s，并将结果写到 bytes.Buffer, 最后打印。简化例子如下：

~/zerun.dong/code/gotest#catwait.go
packagemain
import(
"bytes"
"context"
"fmt"
"os/exec"
"time"
)
funcmain(){
ctx,cancelFn:=context.WithTimeout(context.Background(),time.Second*30)
defercancelFn()
cmd:=exec.CommandContext(ctx,"./sleep.sh")
varbbytes.Buffer
cmd.Stdout=&b//剧透，坑在这里
cmd.Stderr=&b
cmd.Start()
cmd.Wait()
fmt.Println("recive:",b.String())
}

这个是 sleep.sh，模拟子进程

#!/bin/sh
echo"insleep"
sh./sleep1.sh

这是 sleep1.sh 模拟孙进程，sleep 1000 阻塞在这里

#!/bin/sh
sleep1000

###现象启动测试 wait 程序，查看 ps axjf | less查看

ppidpidpgid
2468326903269032690?-1Ss00:00\_sshd:root@pts/6
32690328183281832818pts/628746Ss00:00|\_-bash
32818285312853132818pts/628746S00:00|\_strace./wait
28531285432853132818pts/628746Sl00:00||\_./wait
28543285592853132818pts/628746S00:00||\_/bin/sh/root/dongzerun/sleep.sh
28559285602853132818pts/628746S00:00||\_sh/root/dongzerun/sleep1.sh
28560285632853132818pts/628746S00:00||\_sleep1000

等过了 30s，通过 ps axjf | less 查看

2468326903269032690?-1Ss00:00\_sshd:root@pts/6
32690328183281832818pts/636192Ss00:00|\_-bash
32818285312853132818pts/636192S00:00|\_strace./wait
28531285432853132818pts/636192Sl00:00||\_./wait
1285602853132818pts/636192S00:00sh/root/dongzerun/sleep1.sh
28560285632853132818pts/636192S00:00\_sleep1000

通过上面的 case，可以看到 sleep1.sh 成了孤儿进程，被 init 1 认领，但是 28543 wait 并没有退出，那他在做什么???

分析

这个时候僵住了，祭出我们的 strace 大法，查看 wait 程序

epoll_ctl(4,EPOLL_CTL_DEL,6,{0,{u32=0,u64=0}})=0
close(6)=0
futex(0xc420054938,FUTEX_WAKE,1)=1
waitid(P_PID,28559,{si_signo=SIGCHLD,si_code=CLD_KILLED,si_pid=28559,si_status=SIGKILL,si_utime=0,si_stime=0},WEXITED|WNOWAIT,NULL)=0
卡在这里约30s
---SIGCHLD{si_signo=SIGCHLD,si_code=CLD_KILLED,si_pid=28559,si_status=SIGKILL,si_utime=0,si_stime=0}---
rt_sigreturn()=0
futex(0x9a0378,FUTEX_WAKE,1)=1
futex(0x9a02b0,FUTEX_WAKE,1)=1
wait4(28559,[{WIFSIGNALED(s)&&WTERMSIG(s)==SIGKILL}],0,{ru_utime={0,0},ru_stime={0,0},...})=28559
futex(0x9a0b78,FUTEX_WAIT,0,NULL

通过 go 源码可以看到 go exec wait 时，会先执行 waitid, 阻塞在这里，然后再来一次 wait4 等待最终退出结果

不太明白为什么两次 wait... 但是最后卡在了 futex 这里，看着像在等待什么资源???

打开 golang pprof, 再次运行程序，并 pprof

gofunc(){
err:=http.ListenAndServe(":6060",nil)
iferr!=nil{
fmt.Printf("failedtostartpprofmonitor:%s",err)
}
}()

curlhttp://127.0.0.1:6060/debug/pprof/goroutine?debug=2
goroutine1[chanreceive]:
os/exec.(*Cmd).Wait(0xc42017a000,0x7c3d40,0x0)
/usr/local/go/src/os/exec/exec.go:454+0x135
main.main()
/root/dongzerun/wait.go:32+0x167

程序没有退出，并不可思议的卡在了 exec.go:454 行代码，查看源码：

//WaitreleasesanyresourcesassociatedwiththeCmd.
func(c*Cmd)Wait()error{
......
state,err:=c.Process.Wait()
ifc.waitDone!=nil{
close(c.waitDone)
}
c.ProcessState=state
varcopyErrorerror
forrangec.goroutine{
//卡在了这里
iferr:=<-c.errch;err!=nil&&copyError==nil{
copyError=err
}
}
c.closeDescriptors(c.closeAfterWait)
......
returncopyError
}

通过源代码分析，程序 wait 卡在了 <-c.errch 获取 chan 数据。那么 errch 是如何生成的呢?

查看 cmd.Start 源码，go 将 cmd.Stdin, cmd.Stdout, cmd.Stderr 组织成 *os.File，并依次写到数组childFiles 中，这个数组索引就对应子进程的 0，1，2 文描术符，即子进程的标准输入，输出，错误

typeFfunc(*Cmd)(*os.File,error)
for_,setupFd:=range[]F{(*Cmd).stdin,(*Cmd).stdout,(*Cmd).stderr}{
fd,err:=setupFd(c)
iferr!=nil{
c.closeDescriptors(c.closeAfterStart)
c.closeDescriptors(c.closeAfterWait)
returnerr
}
c.childFiles=append(c.childFiles,fd)
}
c.childFiles=append(c.childFiles,c.ExtraFiles...)
varerrerror
c.Process,err=os.StartProcess(c.Path,c.argv(),&os.ProcAttr{
Dir:c.Dir,
Files:c.childFiles,
Env:dedupEnv(c.envv()),
Sys:c.SysProcAttr,
})

在执行 setupFd 时，会有一个关键的操作，打开 pipe 管道，封装一个匿名 func，功能就是将子进程的输出结果写到 pipe 或是将 pipe 数据写到子进程标准输入，最后关闭 pipe

这个匿名函数最终在 Start 时执行

func(c*Cmd)stdin()(f*os.File,errerror){
ifc.Stdin==nil{
f,err=os.Open(os.DevNull)
iferr!=nil{
return
}
c.closeAfterStart=append(c.closeAfterStart,f)
return
}
iff,ok:=c.Stdin.(*os.File);ok{
returnf,nil
}
pr,pw,err:=os.Pipe()
iferr!=nil{
return
}
c.closeAfterStart=append(c.closeAfterStart,pr)
c.closeAfterWait=append(c.closeAfterWait,pw)
c.goroutine=append(c.goroutine,func()error{
_,err:=io.Copy(pw,c.Stdin)
ifskip:=skipStdinCopyError;skip!=nil&&skip(err){
err=nil
}
iferr1:=pw.Close();err==nil{
err=err1
}
returnerr
})
returnpr,nil
}

重新运行测试 case，并用 lsof 查看进程打开了哪些资源

root@nb1963:~/dongzerun#psaux|grepwait
root45310.00.01221806520pts/6Sl17:240:00./wait
root47260.00.0104842144pts/6S+17:240:00grep--color=autowait
root@nb1963:~/dongzerun#
root@nb1963:~/dongzerun#psaux|grepsleep
root45430.00.04456688pts/6S17:240:00/bin/sh/root/dongzerun/sleep.sh
root45480.00.04456760pts/6S17:240:00sh/root/dongzerun/sleep1.sh
root45500.00.05928748pts/6S17:240:00sleep1000
root47840.00.0104802188pts/6S+17:240:00grep--color=autosleep
root@nb1963:~/dongzerun#
root@nb1963:~/dongzerun#lsof-p4531
COMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAME
wait4531root0wCHR1,30t01029/dev/null
wait4531root1wREG8,1943714991345/root/dongzerun/nohup.out
wait4531root2wREG8,1943714991345/root/dongzerun/nohup.out
wait4531root3uIPv620055682150t0TCP*:6060(LISTEN)
wait4531root4u00000,1009076anon_inode
wait4531root5rFIFO0,90t02005473170pipe
root@nb1963:~/dongzerun#lsof-p4543
COMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAME
sleep.sh4543root0rCHR1,30t01029/dev/null
sleep.sh4543root1wFIFO0,90t02005473170pipe
sleep.sh4543root2wFIFO0,90t02005473170pipe
sleep.sh4543root10rREG8,1554993949/root/dongzerun/sleep.sh
root@nb1963:~/dongzerun#lsof-p4550
COMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAME
sleep4550rootmemREG8,116076649179617/usr/lib/locale/locale-archive
sleep4550root0rCHR1,30t01029/dev/null
sleep4550root1wFIFO0,90t02005473170pipe
sleep4550root2wFIFO0,90t02005473170pipe

原因总结

孙进程启动后，默认会继承父进程打开的文件描述符，即 node 2005473170 的 pipe

那么当父进程被 kill -9 后会清理资源，关闭打开的文件，但是 close 只是引用计数减 1。实际上孙进程仍然打开着 pipe。回头看 agent 代码

c.goroutine=append(c.goroutine,func()error{
_,err:=io.Copy(pw,c.Stdin)
ifskip:=skipStdinCopyError;skip!=nil&&skip(err){
err=nil
}
iferr1:=pw.Close();err==nil{
err=err1
}
returnerr
})

那么当子进程执行结束后，go cmd 执行这个匿名函数的 io.Copy 来读取子进程输出数据，永远没有数据可读，也没有超时，阻塞在 copy 这里

解决方案

原因找到了，解决方法也就有了。

子进程启动孙进程时，增加 CloseOnEec 标记，但不现实，还要看孙进程的输出日志
io.Copy 改写，增加超时调用，理论上可行，但是要改源码
超时 kill, 不单杀子进程，而是杀掉进程组，此时 pipe 会被真正的关闭，触发 io.Copy 返回

最终采用方案 3，简化代码如下，主要改动点有两处：

SysProcAttr 配置 Setpgid，让子进程与孙进程，拥有独立的进程组id，即子进程的 pid

Syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL) 杀进程时指定进程组

funcRun(instancestring,envmap[string]string)bool{
var(
cmd*exec.Cmd
proc*Process
sysProcAttr*syscall.SysProcAttr
)
t:=time.Now()
sysProcAttr=&syscall.SysProcAttr{
Setpgid:true,//使子进程拥有自己的pgid，等同于子进程的pid
Credential:&syscall.Credential{
Uid:uint32(uid),
Gid:uint32(gid),
},
}
//超时控制
ctx,cancel:=context.WithTimeout(context.Background(),time.Duration(j.Timeout)*time.Second)
defercancel()
ifj.ShellMode{
cmd=exec.Command("/bin/bash","-c",j.Command)
}else{
cmd=exec.Command(j.cmd[0],j.cmd[1:]...)
}
cmd.SysProcAttr=sysProcAttr
varbbytes.Buffer
cmd.Stdout=&b
cmd.Stderr=&b
iferr:=cmd.Start();err!=nil{
j.Fail(t,instance,fmt.Sprintf("%s\n%s",b.String(),err.Error()),env)
returnfalse
}
waitChan:=make(chanstruct{},1)
deferclose(waitChan)
//超时杀掉进程组或正常退出
gofunc(){
select{
case<-ctx.Done():
log.Warnf("timeoutkilljob%s-%s%sppid:%d",j.Group,j.ID,j.Name,cmd.Process.Pid)
syscall.Kill(-cmd.Process.Pid,syscall.SIGKILL)
case<-waitChan:
}
}()
iferr:=cmd.Wait();err!=nil{
j.Fail(t,instance,fmt.Sprintf("%s\n%s",b.String(),err.Error()),env)
returnfalse
}
returntrue
}

但这种方式，也有个局限，目前只适用于类 linux 平台

小结

大家也可以看到，只要权限足够，问题稳定复现，没有查不出来的问题。套路也都差不多，回归问题开始，python request 库不写 timeout 的比比皆是 ...

©本文为清一色官方代发，观点仅代表作者本人，与清一色无关。清一色对文中陈述、观点判断保持中立，不对所包含内容的准确性、可靠性或完整性提供任何明示或暗示的保证。本文不作为投资理财建议，请读者仅作参考，并请自行承担全部责任。文中部分文字/图片/视频/音频等来源于网络，如侵犯到著作权人的权利，请与我们联系（微信/QQ:1074760229）。转载请注明出处：清一色财经