Linux内核:crash浅析tasklist_lock与进程释放
本文通过crash分析一个hang在tasklist_lock上的soft lockup,透过表象,来分析内核在有大量进程释放时的表现
栈是在tasklist_read_lock上
crash> bt
PID: 27250 TASK: ffff910d0c4c3f40 CPU: 4 COMMAND: "ZMQbg/1"
#0 [ffff91339f303d38] machine_kexec at ffffffffb8a60b2a
#1 [ffff91339f303d98] __crash_kexec at ffffffffb8b13402
#2 [ffff91339f303e68] panic at ffffffffb9107a75
#3 [ffff91339f303ee8] watchdog_timer_fn at ffffffffb8b3f5f1
#4 [ffff91339f303f20] __hrtimer_run_queues at ffffffffb8abeff6
#5 [ffff91339f303f78] hrtimer_interrupt at ffffffffb8abf58f
#6 [ffff91339f303fc0] local_apic_timer_interrupt at ffffffffb8a5782b
#7 [ffff91339f303fd8] smp_apic_timer_interrupt at ffffffffb91240a3
#8 [ffff91339f303ff0] apic_timer_interrupt at ffffffffb91207f2
--- <IRQ stack> ---
#9 [ffff9134e2a0bda8] apic_timer_interrupt at ffffffffb91207f2
[exception RIP: tasklist_read_lock+34]
RIP: ffffffffb8a8dcd2 RSP: ffff9134e2a0be50 RFLAGS: 00000206
RAX: 0000000000000050 RBX: ffffffffb8abe870 RCX: ffff910d8eb98000
RDX: ffff910d0ce76c00 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff9134e2a0be50 R8: ffff910d8eb98028 R9: a000000000000000
R10: b3a8000000000000 R11: 00011a69df917675 R12: 000ba4a5b60d3a4b
R13: 0000000000000000 R14: 0000000000000000 R15: ffff9134e2a0bda8
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff9134e2a0be58] do_wait at ffffffffb8a9703b
#11 [ffff9134e2a0bec0] sys_wait4 at ffffffffb8a98290
#12 [ffff9134e2a0bf50] system_call_fastpath at ffffffffb911f7d5
RIP: 00007f11052ef199 RSP: 00007ffffee17100 RFLAGS: 00000202
RAX: 000000000000003d RBX: 00007f11056770d8 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 00007ffffee16f94 RDI: 0000000000007895
RBP: 0000558346cefd70 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffffee16f94
R13: 0000000000007895 R14: 0000000000000001 R15: 000055834721d068
ORIG_RAX: 000000000000003d CS: 0033 SS: 002b
找到对应的代码,是在等tasklist_waiters

看到tasklist_waiters是80,所以会导致进入cpu_relax太久,进而触发soft lockup
crash> p tasklist_waiters
tasklist_waiters = $1 = {
counter = 80
}
看到大量进程都是21845起来的(由于涉及业务进程相关信息,不放截图了)

栈都在释放proc下的dentry上重复释放
crash> bt 6840
PID: 6840 TASK: ffff910d83a1bf40 CPU: 17 COMMAND: "xxx"
#0 [ffff910d03933b18] __schedule at ffffffffb91128d4
#1 [ffff910d03933ba8] __cond_resched at ffffffffb8acb9e6
#2 [ffff910d03933bc0] _cond_resched at ffffffffb91131ea
#3 [ffff910d03933bd0] dput at ffffffffb8c33261
#4 [ffff910d03933c00] proc_flush_task at ffffffffb8c95f24
#5 [ffff910d03933c78] release_task at ffffffffb8a95fff
#6 [ffff910d03933ce0] do_exit at ffffffffb8a97b0c
#7 [ffff910d03933d78] do_group_exit at ffffffffb8a97fcf
#8 [ffff910d03933da8] get_signal_to_deliver at ffffffffb8aa887e
#9 [ffff910d03933e40] do_signal at ffffffffb8a2a527
#10 [ffff910d03933f30] do_notify_resume at ffffffffb8a2ac22
#11 [ffff910d03933f50] int_signal at ffffffffb911fb1d
RIP: 00007f23194019f5 RSP: 00007f1afeff5720 RFLAGS: 00000246
RAX: fffffffffffffe00 RBX: 00007f1afeff57c0 RCX: ffffffffffffffff
RDX: 0000000000000001 RSI: 0000000000000080 RDI: 00007f1afeff57ec
RBP: 00007f1afeff57e8 R8: 00007f1afeff5700 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f115ef4e0b0
R13: 00007f115ef4e0a8 R14: 00007f115ef4e0a0 R15: 00007f115ef4e088
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
crash> bt 6844
PID: 6844 TASK: ffff910d83a1eeb0 CPU: 58 COMMAND: "xxx"
#0 [ffff910d08babb18] __schedule at ffffffffb91128d4
#1 [ffff910d08babba8] __cond_resched at ffffffffb8acb9e6
#2 [ffff910d08babbc0] _cond_resched at ffffffffb91131ea
#3 [ffff910d08babbd0] dput at ffffffffb8c33261
#4 [ffff910d08babc00] proc_flush_task at ffffffffb8c95f24
#5 [ffff910d08babc78] release_task at ffffffffb8a95fff
#6 [ffff910d08babce0] do_exit at ffffffffb8a97b0c
#7 [ffff910d08babd78] do_group_exit at ffffffffb8a97fcf
#8 [ffff910d08babda8] get_signal_to_deliver at ffffffffb8aa887e
#9 [ffff910d08babe40] do_signal at ffffffffb8a2a527
#10 [ffff910d08babf30] do_notify_resume at ffffffffb8a2ac22
#11 [ffff910d08babf50] int_signal at ffffffffb911fb1d
RIP: 00007f23194019f5 RSP: 00007f10148f5720 RFLAGS: 00000246
RAX: fffffffffffffe00 RBX: 00007f10148f57c0 RCX: ffffffffffffffff
RDX: 0000000000000001 RSI: 0000000000000080 RDI: 00007f10148f57ec
RBP: 00007f10148f57e8 R8: 00007f10148f5700 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f102fbfb530
R13: 00007f102fbfb528 R14: 00007f102fbfb520 R15: 00007f102fbfb508
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
而21845在do_wait,这里即使不分析代码,直觉也会告诉我们,children太多,根本等不过来
crash> bt 21845
PID: 21845 TASK: ffff910d3eef1fa0 CPU: 4 COMMAND: "sh"
#0 [ffff910d039dbdb8] __schedule at ffffffffb91128d4
#1 [ffff910d039dbe48] schedule at ffffffffb9112f49
#2 [ffff910d039dbe58] do_wait at ffffffffb8a97176
#3 [ffff910d039dbec0] sys_wait4 at ffffffffb8a98290
#4 [ffff910d039dbf50] system_call_fastpath at ffffffffb911f7d5
RIP: 00007f12dad4f41c RSP: 00007ffecd89efe0 RFLAGS: 00010246
RAX: 000000000000003d RBX: 0000000000000000 RCX: 00007f12dacc05e0
RDX: 0000000000000000 RSI: 00007ffecd89ee50 RDI: ffffffffffffffff
RBP: 00000000012bdf90 R8: 00000000012bdf90 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000001 R14: 00000000012bfe60 R15: 0000000000000000
ORIG_RAX: 000000000000003d CS: 0033 SS: 002b
tasklist_lock的情况一样是不容乐观:
crash> p tasklist_lock
tasklist_lock = $1 = {
raw_lock = {
cnts = {
counter = 255
},
lock = {
val = {
counter = 34144513
}
}
}
}
为什么21845会起这么多任务,需要相关业务自查
到这里对问题的定位已经完成了
后面分析一下起的这些线程的表现来解析tasklist_lock和任务释放:
先看看这个栈,从他的状态可以得知他已经进入了do_exit(), 但是在dput proc里的dentry里被调度走了?
crash> bt 6840
PID: 6840 TASK: ffff910d83a1bf40 CPU: 17 COMMAND: "xxx"
#0 [ffff910d03933b18] __schedule at ffffffffb91128d4
#1 [ffff910d03933ba8] __cond_resched at ffffffffb8acb9e6
#2 [ffff910d03933bc0] _cond_resched at ffffffffb91131ea
#3 [ffff910d03933bd0] dput at ffffffffb8c33261
#4 [ffff910d03933c00] proc_flush_task at ffffffffb8c95f24
#5 [ffff910d03933c78] release_task at ffffffffb8a95fff
#6 [ffff910d03933ce0] do_exit at ffffffffb8a97b0c
#7 [ffff910d03933d78] do_group_exit at ffffffffb8a97fcf
#8 [ffff910d03933da8] get_signal_to_deliver at ffffffffb8aa887e
#9 [ffff910d03933e40] do_signal at ffffffffb8a2a527
#10 [ffff910d03933f30] do_notify_resume at ffffffffb8a2ac22
#11 [ffff910d03933f50] int_signal at ffffffffb911fb1d
RIP: 00007f23194019f5 RSP: 00007f1afeff5720 RFLAGS: 00000246
RAX: fffffffffffffe00 RBX: 00007f1afeff57c0 RCX: ffffffffffffffff
RDX: 0000000000000001 RSI: 0000000000000080 RDI: 00007f1afeff57ec
RBP: 00007f1afeff57e8 R8: 00007f1afeff5700 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f115ef4e0b0
R13: 00007f115ef4e0a8 R14: 00007f115ef4e0a0 R15: 00007f115ef4e088
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
找下对应的代码是哪里?
/usr/src/debug/kernel-3.10.0-862.el7/linux-3.10.0-862.el7.x86_64/fs/proc/base.c: 3038
0xffffffffb8c95f1c <proc_flush_task+364>: mov %r14,%rdi
0xffffffffb8c95f1f <proc_flush_task+367>: callq 0xffffffffb8c33240 <dput>
/usr/src/debug/kernel-3.10.0-862.el7/linux-3.10.0-862.el7.x86_64/fs/proc/base.c: 3077
0xffffffffb8c95f24 <proc_flush_task+372>: mov -0x58(%rbp),%rax

可以看到,是dput(leader)没成功?
从栈上找一下dentry,看看leader是什么
crash> dis dput
0xffffffffb8c33240 <dput>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffb8c33245 <dput+5>: push %rbp
0xffffffffb8c33246 <dput+6>: test %rdi,%rdi
0xffffffffb8c33249 <dput+9>: mov %rsp,%rbp
0xffffffffb8c3324c <dput+12>: push %r14
0xffffffffb8c3324e <dput+14>: push %r13
0xffffffffb8c33250 <dput+16>: push %r12
0xffffffffb8c33252 <dput+18>: push %rbx
0xffffffffb8c33253 <dput+19>: mov %rdi,%rbx
#3 [ffff910d03933bd0] dput at ffffffffb8c33261
ffff910d03933bd8: 0000000000000000 ffff910d03933c33
ffff910d03933be8: 0000000000001ab8 ffff91346198efc0
ffff910d03933bf8: ffff910d03933c70 ffffffffb8c95f24
crash> dentry ffff91346198efc0
struct dentry {
d_flags = 16777356,
d_seq = {
sequence = 2
},
d_hash = {
next = 0x0,
pprev = 0xffffb09559b1ac20
},
d_parent = 0xffff910d7f810180,
d_name = {
{
{
hash = 892875108,
len = 5
},
hash_len = 22367711588
},
name = 0xffff91346198eff8 "21852"
crash> inode.i_nlink,__i_nlink 0xffff91351dd6b380
i_nlink = 9
__i_nlink = 9
可以看到,他的leader是21852
至于为什么要释放leader,源码的注释里就给出了答案,是tgid下的dentry:

看下leader 21852的栈:
crash> bt 21852
PID: 21852 TASK: ffff91347f28dee0 CPU: 23 COMMAND: "xxx"
#0 [ffff910d0767bbe8] __schedule at ffffffffb91128d4
#1 [ffff910d0767bc70] __cond_resched at ffffffffb8acb9e6
#2 [ffff910d0767bc88] _cond_resched at ffffffffb91131ea
#3 [ffff910d0767bc98] down_read at ffffffffb9112062
#4 [ffff910d0767bcb0] exit_signals at ffffffffb8aa8cc5
#5 [ffff910d0767bce0] do_exit at ffffffffb8a975d4
#6 [ffff910d0767bd78] do_group_exit at ffffffffb8a97fcf
#7 [ffff910d0767bda8] get_signal_to_deliver at ffffffffb8aa887e
#8 [ffff910d0767be40] do_signal at ffffffffb8a2a527
#9 [ffff910d0767bf30] do_notify_resume at ffffffffb8a2ac22
#10 [ffff910d0767bf50] int_signal at ffffffffb911fb1d
RIP: 00007f23179d4337 RSP: 00007ffd18584c68 RFLAGS: 00000206
RAX: 0000000000000000 RBX: 00007f1e942feec8 RCX: ffffffffffffffff
RDX: 0000000000000006 RSI: 000000000000555c RDI: 000000000000555c
RBP: 00007f2317d66868 R8: 00007f2324ea43d0 R9: 00007f2324ea3ac0
R10: 0000000000000008 R11: 0000000000000206 R12: 00007f1e942fef00
R13: 00007ffd18584e20 R14: 00007ffd18585108 R15: 00007ffd18585107
ORIG_RAX: 00000000000000ea CS: 0033 SS: 002b
21582已经到exit_signals了,但是parent 在sys_wait太多所以处理不过来了?
先看看这个锁是什么东西
0xffffffffb8aa8c9c <exit_signals+12>: mov %rdi,%rbx
0xffffffffb8aa8c9f <exit_signals+15>: sub $0x10,%rsp
0xffffffffb8aa8ca3 <exit_signals+19>: mov %gs:0x28,%rax
0xffffffffb8aa8cac <exit_signals+28>: mov %rax,-0x18(%rbp)
0xffffffffb8aa8cb0 <exit_signals+32>: xor %eax,%eax
0xffffffffb8aa8cb2 <exit_signals+34>: mov 0x768(%rdi),%rax
0xffffffffb8aa8cb9 <exit_signals+41>: lea 0x3f0(%rax),%rdi
0xffffffffb8aa8cc0 <exit_signals+48>: callq 0xffffffffb9112050 <down_read>
0xffffffffb9112050 <down_read>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffb9112055 <down_read+5>: push %rbp
0xffffffffb9112056 <down_read+6>: mov %rsp,%rbp
0xffffffffb9112059 <down_read+9>: push %rbx
#3 [ffff910d0767bc98] down_read at ffffffffb9112062
ffff910d0767bca0: ffff91347f28dee0 ffff910d0767bcd8
ffff910d0767bcb0: ffffffffb8aa8cc5
crash> eval 0xffff91347f28dee0 + 0x768
hexadecimal: ffff91347f28e648
decimal: 18446622253390554696 (-121820318996920)
octal: 1777774423217712163110
binary: 1111111111111111100100010011010001111111001010001110011001001000
crash> rd ffff91347f28e648
ffff91347f28e648: ffff9134f8bb1680 ....4...
crash> eval 0xffff9134f8bb1680 + 0x3f0
hexadecimal: ffff9134f8bb1a70
decimal: 18446622255430179440 (-121818279372176)
octal: 1777774423237056615160
binary: 1111111111111111100100010011010011111000101110110001101001110000
crash> rw_semaphore ffff9134f8bb1a70
struct rw_semaphore {
{
count = {
counter = 0
},
__UNIQUE_ID_rh_kabi_hide4 = {
count = 0
},
{<No data fields>}
},
wait_lock = {
raw_lock = {
val = {
counter = 0
}
}
},
osq = {
tail = {
counter = 0
}
},
wait_list = {
next = 0xffff9134f8bb1a80
},
owner = 0x1
}
看看有多少人在等这个锁
crash> list -l rwsem_waiter.list -s rwsem_waiter -h 0xffff9134f8bb1a80
ffff9134f8bb1a80
struct rwsem_waiter {
list = {
next = 0xffff9134f8bb1a80,
prev = 0x1
},
task = 0x0,
type = RWSEM_WAITING_FOR_READ
}
并没有人在等这个rwsem!

crash> task -R signal 21852
PID: 21852 TASK: ffff91347f28dee0 CPU: 23 COMMAND: "xxx"
signal = 0xffff9134f8bb1680,
crash> signal_struct.group_rwsem 0xffff9134f8bb1680
group_rwsem = {
{
count = {
counter = 0
},
__UNIQUE_ID_rh_kabi_hide1 = {
count = 0
},
{<No data fields>}
},
wait_lock = {
raw_lock = {
val = {
counter = 0
}
}
},
osq = {
tail = {
counter = 0
}
},
wait_list = {
next = 0xffff9134f8bb1a80
},
owner = 0x1
}
确认和task_struct里也是一个锁
从代码里,我们知道down_read是might_sleep的:

所以他这里只是单纯的被调度走了,因为进程太多。
同时注意到,exit_signals函数后面也是要拿tasklist_lock的,他必拿不到,所以即使调度回来也没用,还是会导致前面的6840进程释放不了dentry。

所以进程太多,都在退出,且这些进程是有关系的(在一个group里)情况下,释放的时候是有依赖的,会存在忙不过而无法成功释放资源(释放特别慢)的现象。
最后分析一下他们共同的进程parent 21845在sys_wait4的问题:
crash> bt 21845
PID: 21845 TASK: ffff910d3eef1fa0 CPU: 4 COMMAND: "sh"
#0 [ffff910d039dbdb8] __schedule at ffffffffb91128d4
#1 [ffff910d039dbe48] schedule at ffffffffb9112f49
#2 [ffff910d039dbe58] do_wait at ffffffffb8a97176
#3 [ffff910d039dbec0] sys_wait4 at ffffffffb8a98290
#4 [ffff910d039dbf50] system_call_fastpath at ffffffffb911f7d5
RIP: 00007f12dad4f41c RSP: 00007ffecd89efe0 RFLAGS: 00010246
RAX: 000000000000003d RBX: 0000000000000000 RCX: 00007f12dacc05e0
RDX: 0000000000000000 RSI: 00007ffecd89ee50 RDI: ffffffffffffffff
RBP: 00000000012bdf90 R8: 00000000012bdf90 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000001 R14: 00000000012bfe60 R15: 0000000000000000
ORIG_RAX: 000000000000003d CS: 0033 SS: 002b
/usr/src/debug/kernel-3.10.0-862.el7/linux-3.10.0-862.el7.x86_64/kernel/exit.c: 1569
0xffffffffb8a97171 <do_wait+497>: callq 0xffffffffb9112f20 <schedule>
/usr/src/debug/kernel-3.10.0-862.el7/linux-3.10.0-862.el7.x86_64/kernel/exit.c: 1570
0xffffffffb8a97176 <do_wait+502>: jmpq 0xffffffffb8a96ff3 <do_wait+115>

逻辑也比较简单,没事干的时候就调度走,回来就是repeat逻辑回收children,但是由于回收的时候也需要tasklist_lock,所以并不能回收成功。
所以到这里总结一下,说白了就是,进程太多了,需要回收的也太多了,但是得排队等tasklist_lock锁,导致回收不过来了。然后这时候如果有新的进程起来,情况就更加恶化,甚至如果打到内存水位线,导致都去回收内存了,那就是死局了。