最近版本自动化测试出现一个低概率问题,表现为设置数据rpc不返回,触发了rpc挂死保护机制,单板自动重启;记录下分析定位问题的思路。
排查日志
通过初步分析出问题时日志,初步定位如下
- 发现是在设置机框功率封顶时,rpc调用挂了20分钟没有返回;
- 触发了rpc挂死保护机制,给主进程发送了
kill -11
信号,并产生了core文件;
思路确定
看到这,基本感觉应该是死锁导致了rpc挂死,一般死锁有两种可能
- A函数使用锁后没有释放,B函数再次尝试获取该锁时,一直等待,导致rpc挂死
- 解析core文件
- 确定挂死线程
- 分析是挂在哪把锁上
- 在代码中分析该锁还在哪里被用到,分别排查
- A函数获取E锁成功后,尝试获取F锁;同时B函数获取F锁成功,尝试获取E锁;导致AB两个函数相互挂死(这种情况只出现在AB函数分别在两个线程中)
- 解析core文件
- 确定挂死线程
- 分析是挂在锁E上,及该挂死线程的调用栈中是否还获取了F锁没有释放
- 搜索该core文件的所有线程调用栈,看看是否有别的线程的调用栈栈顶是在等待锁,逐个分析
分析
思路基本已经确定,下面就开始实际分析了
- 让gdb加载对应版本的符号表,解析core文件
- 通过
bt
获取挂死线程调用栈
Thread 1 (LWP 1395):
#0 0x1f3351a4 in ?? () from /lib/libpthread.so.0
#1 0x1f32ed7c in pthread_mutex_lock () from /lib/libpthread.so.0
#2 0x2043d6a4 in CSystem::GetRedundancyState (this=0x2096a114 <g_muSmmCfgFile>) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/CSystem.cpp:10517
#3 0x203daac4 in Config::WriteCfgToFile (this=0x2096a114 <g_muSmmCfgFile>) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/Config.cpp:2291
#4 0x2044d378 in CSystem::SetShelfPowerCapping (this=0x2118e768, ulpowercapping=20000) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/CSystem.cpp:3157
#5 0x205508a4 in smmapi_set_shelfpowercapping_ipmi (ulpowercapping=20000) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/smm_api_ipmi.cpp:20430
#6 0x206052f0 in rpccall_smmapi_set_shelf_powercapping (username=0x8bac423c "root", userIP=0x8bac443c "172.190.1.31", usermode=0x8bac447c "", chassisnumber=0, input=<optimized out>,
inputlen=<optimized out>, output=<optimized out>, outputlen=<optimized out>) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/rpc_api.c:8482
#7 0x2061eea0 in all_api_in_one_1_svc (input_parameter=0x8bac4238, q=<optimized out>) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/rpc_server.c:1249
#8 0x2061d464 in rpcapiprog_1 (rqstp=0x8bac44e8, transp=0x21260218) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/rpc_svc.c:434
#9 0x1e72f1c4 in svc_getreq_common () from /lib/libc.so.6
#10 0x1e72f6a0 in svc_getreq_poll () from /lib/libc.so.6
#11 0x1e72ff1c in svc_run () from /lib/libc.so.6
#12 0x2061dc48 in rpc_main () at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/rpc_svc.c:608
#13 0x205b4e00 in rpc_init_task (ulPara1=<optimized out>, ulPara2=<optimized out>, ulPara3=<optimized out>, ulPara4=<optimized out>)
at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/ipmi_main.c:1022
#14 0x2026c938 in VOS_TaskEntry (pulArg=0x1007c) at ipmi_v_task.c:3182
#15 0x1f32c2cc in ?? () from /lib/libpthread.so.0
#16 0x1e702198 in clone () from /lib/libc.so.6
简单分析发现该线程挂死在了Config::WriteCfgToFile
函数中等待锁(记为E锁),而且成员函数CSystem::SetShelfPowerCapping
中,他首先以写的方式获取了一把读写锁(该锁用于成员变量访问及修改的互斥,记为F锁)
thread apply all bt
打印出所有线程调用栈信息,分析所有调用栈顶是在等待锁的线程,找到了如下可以对象
Thread 82 (LWP 1328):
#0 0x1f330670 in pthread_rwlock_rdlock () from /lib/libpthread.so.0
#1 0x20270fec in thread_timed_mutex_lock (id=0x2118e3a8, timeout_rw=1) at ipmi_sm_low.c:412
#2 0x20265ef8 in VOS_SmRdP (Sm_ID=555279272) at ipmi_v_sm.c:329
#3 0x203b4298 in CObItem::CRWMux::CRWMux (line=<optimized out>, ulLockType=<optimized out>, obj=..., this=<optimized out>, file=<optimized out>)
at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/COtlistDef.h:366
#4 CObItem::InvalidateData (this=0x2118e768, IDString=0x207f8f38 <SYN_SYSTEM_IDSTRING> "System Data") at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/COtlistDef.h:253
#5 0x203dac30 in Config::WriteCfgToFile (this=<optimized out>) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/Config.cpp:2370
#6 0x2077a258 in CPortconnection_Manager::set_indexnum_to_system (pSystem=<optimized out>) at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/CPortconnection_manager.cpp:249
#7 0x2077ec84 in CPortconnection_Manager::check_PortConnectionInfo_loop_task (ulPara1=555280232, ulPara2=555297112, ulPara3=<optimized out>, ulPara4=<optimized out>)
at /usr1/code/HMM_code_Build/code/current/iMM/SMM/src/IPMI/src/ipmi/CPortconnection_manager.cpp:336
#8 0x2026c938 in VOS_TaskEntry (pulArg=0x10073) at ipmi_v_task.c:3182
#9 0x1f32c2cc in ?? () from /lib/libpthread.so.0
#10 0x1e702198 in clone () from /lib/libc.so.6
该线程也有过执行Config::WriteCfgToFile
,说明它已经先获取到了E锁;在执行CObItem::CRWMux::CRWMux
希望以读的方式获取F锁时,卡住了
- 至此,问题已经基本确认清楚,剩下就是修改了
总结
- 创建或者修改线程,忌讳一个线程占用2个及以上的锁,如果无法避免就一定要分析用到的锁在别的线程的使用情况