Hi,All:

很抱歉,昨天(3.28)晚上8点多clan以及相关业务上线,上线测试回归,一切正常。晚上10点过几分,用户反馈系统出错,排查之后发现线上两台clan均挂掉。10:10:34重启暂时恢复线上业务。

查询日志发现,在2017-03-28 21:59:54.793的时候,dubbo provider不能注册服务。此时,tomcat的进程已经down掉了。后续的日志都是一些不能注册或者不能订阅的错误,甚至是调用方法时service没有实例化的错误,应该属于tomcat down掉后,dubbo的进程还没有shutdown(从时间上可以看出,具体看后面的两段日志[1])。

后续从日志中,早于服务挂掉的时间,发现有dubbo线程池耗尽的警告。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[2017-03-23 17:09:35.432] clan WARN   [DUBBO] An exception was thrown by a user handler while handling an exception event ([id: 0xa9483a99, /10.160.37.164:36657 => /10.25.0.15:31100] EXCEPTION: com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process received event .), dubbo version: 2.8.4, current host: 10.25.0.15 [New I/O worker #8] org.jboss.netty.channel.DefaultChannelPipeline.warn(62)
com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process caught event .
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:67)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractPeer.caught(AbstractPeer.java:127)
at com.alibaba.dubbo.remoting.transport.netty.NettyHandler.exceptionCaught(NettyHandler.java:112)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.exceptionCaught(NettyCodecAdapter.java:165)
at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:48)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.messageReceived(NettyCodecAdapter.java:148)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.RejectedExecutionException: Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-10.25.0.15:31100, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200), Task: 4783 (completed: 4583), Executor status:(isShutdown:false, isTerminated:false, isTerminating:false), in dubbo://10.25.0.15:31100!
at com.alibaba.dubbo.common.threadpool.support.AbortPolicyWithReport.rejectedExecution(AbortPolicyWithReport.java:53)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:65)
... 19 common frames omitted

线程池耗尽,内存溢出,错误也很明显。

1
2
3
4
23-Mar-2017 18:56:31.121 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [clan] appears to have started a thread named [DubboResponseTimeoutScanTimer] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
java.lang.Thread.sleep(Native Method)
com.alibaba.dubbo.remoting.exchange.support.DefaultFuture$RemotingInvocationTimeoutScan.run(DefaultFuture.java:300)
java.lang.Thread.run(Thread.java:745)

查询资料,通过jstat查询GC,因为重启之后服务正常,短时间未能从YGC和FGC中获取到与错误相关的信息。从dubbo的监控可以得知,调用clan最大的并发是135左右,并没有达到200的缺省配置。随后,空空笑和小胡子均提出了一个疑问,dubbo中调用其他业务的超时配置为何是1小时?

dubbo缺省配置超时1000ms,也就是说,线程池中的每个线程持有时间最大是1s*3(3次重试),之后应该释放该线程占有的资源。但是clan配置了1h,那么,线程持有时间最小应该有1h,尽管线程的事情做完了,但依旧占用资源。所以,请求量增长的时候,线程池中的持有1h的这线线程越来越多,最后将整个资源占用。虽然有两台服务,但是遭遇了同样的情况,负载压力都很大,甚至连雪崩都没来得及已经挂了(查看第二段日志时间,shutdown hook的时间差约为26.5s[2])。

上面的相关信息仅从日志中分析得知,暂时没有去验证,目前解决的方法是配置clan中调用其他业务超时时间为缺省值,即1s。本着解决问题的态度写了此报告,上述分析必有错误,恳请大家给我提出意见。关于dubbo线程池耗尽的异常,后面有空我在本地环境验证一番。

最后,请大家以此事故为鉴,做好项目参数的配置。

[附录1]

  • ① duubo run shutdown【开始关闭dubbo进程】

首先关闭所有的注册服务。

1
2
[2017-03-28 21:59:54.451] clan INFO   [DUBBO] Run shutdown hook now., dubbo version: 2.8.4, current host: 10.25.0.15 [DubboShutdownHook] com.alibaba.dubbo.config.AbstractConfig.info(42)
[2017-03-28 21:59:54.452] clan INFO [DUBBO] Close all registries [zookeeper://10.175.206.177:2182/com.alibaba.dubbo.registry.RegistryService?application=clan&backup=10.117.18.169:2182,10.160.4.48:2182&client=curator&dubbo=2.8.4&interface=com.alibaba.dubbo.registry.RegistryService&organization=dongjia&owner=dongjia&pid=14956&timestamp=1490703425648], dubbo version: 2.8.4, current host: 10.25.0.15 [DubboShutdownHook] com.alibaba.dubbo.registry.support.AbstractRegistryFactory.info(42)
  • ② shutdown worker【最终关闭dubbo进程】
1
2
3
4
5
6
7
8
9
10
11
[2017-03-28 21:59:54.853] clan WARN   [DUBBO] An exception was thrown by an exception handler., dubbo version: 2.8.4, current host: 10.25.0.15 [DubboClientReconnectTimer-thread-1] org.jboss.netty.channel.DefaultChannelPipeline.warn(62)
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:115)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:73)
at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:57)
at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
at org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
at org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496)
at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:46)
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54)

[附录2]

  • ① 先挂掉的服务
1
[2017-03-28 21:59:54.451] clan INFO   [DUBBO] Run shutdown hook now., dubbo version: 2.8.4, current host: 10.25.0.15 [DubboShutdownHook] com.alibaba.dubbo.config.AbstractConfig.info(42)

○ shutdown hook的时间是 2017-03-28 21:59:54.451。

  • ②后挂掉的服务
1
[2017-03-28 22:00:21.146] clan INFO   [DUBBO] Run shutdown hook now., dubbo version: 2.8.4, current host: 10.252.111.151 [DubboShutdownHook] com.alibaba.dubbo.config.AbstractConfig.info(42)

○ shutdown hook的时间是 2017-03-28 22:00:21.146。