Details
Description
One node is failed over automatically, due to what looks like some kernel or driver bug. That's OK. But then right after that, a 2nd node in the cluster which otherwise looks healthy has CPU spike very high (seen in the AWS administration console), becomes unresponsive, and has to be rebooted.
This happened more than once. On a previous time, a process logger caught beam.smp at 100% CPU, AND memcached nearly 100%.
Logs will be provided in a private comment.
On the latest one, the first failing node has dmesg ouput like this:
[302338.512230] INFO: task xfssyncd:8805 blocked for more than 120 seconds.
[302338.512241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[302338.512246] xfssyncd D ffffffff80547400 0 8805 2 0x00000000
[302338.512252] ffff880444055be0 0000000000000246 ffff880443d1bb60 ffff880444055b60
[302338.512259] ffff8804423677e8 ffff880444055ba8 000000000000a380 ffff880443becb28
[302338.512265] 000000000000a380 ffff880444055b90 000000000000a380 000000000000a380
[302338.512271] Call Trace:
[302338.512280] [<ffffffff80535a45>] schedule_timeout+0x175/0x1d0
[302338.512286] [<ffffffff8020b865>] ? __bio_add_page+0x205/0x270
[302338.512291] [<ffffffff80534c59>] wait_for_common+0xf9/0x210
[302338.512297] [<ffffffff80130180>] ? default_wake_function+0x0/0x10
[302338.512301] [<ffffffff80534e08>] wait_for_completion+0x18/0x20
[302338.512306] [<ffffffff80348a89>] xfs_reclaim_inode+0x139/0x1a0
[302338.512310] [<ffffffff803499a3>] xfs_reclaim_inode_now+0x93/0xa0
[302338.512314] [<ffffffff80349910>] ? xfs_reclaim_inode_now+0x0/0xa0
[302338.512318] [<ffffffff80349502>] xfs_inode_ag_walk+0x72/0xd0
[302338.512322] [<ffffffff80349910>] ? xfs_reclaim_inode_now+0x0/0xa0
[302338.512326] [<ffffffff803495c7>] xfs_inode_ag_iterator+0x67/0xa0
[302338.512330] [<ffffffff80349614>] xfs_reclaim_inodes+0x14/0x20
[302338.512334] [<ffffffff80349679>] xfs_sync_worker+0x59/0x90
[302338.512337] [<ffffffff80348d07>] xfssyncd+0x177/0x220
[302338.512341] [<ffffffff80348b90>] ? xfssyncd+0x0/0x220
[302338.512347] [<ffffffff80154ca6>] kthread+0xa6/0xb0
[302338.512353] [<ffffffff8010a57a>] child_rip+0xa/0x20
[302338.512358] [<ffffffff80154c00>] ? kthread+0x0/0xb0
[302338.512362] [<ffffffff8010a570>] ? child_rip+0x0/0x20
[302338.512368] INFO: task beam.smp:9594 blocked for more than 120 seconds.
[302338.512372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[302338.512376] beam.smp D ffffffff801a4ab0 0 9594 1 0x00000000
[302338.512381] ffff88044389fc78 0000000000000286 0000000000000001 ffff88044389fbf8
[302338.512388] ffff88044458e790 ffff88044389fc40 000000000000a380 ffff88044389c3e8
[302338.512394] 000000000000a380 ffff88044389fc28 000000000000a380 000000000000a380
[302338.512399] Call Trace:
[302338.512409] [<ffffffffa006a616>] ? dm_get_table+0x46/0x60 [dm_mod]
[302338.512417] [<ffffffffa006a79b>] ? dm_unplug_all+0x4b/0x70 [dm_mod]
[302338.512422] [<ffffffff801a4ab0>] ? sync_page+0x0/0x50
[302338.512426] [<ffffffff805357d8>] io_schedule+0x28/0x40
[302338.512430] [<ffffffff801a4aed>] sync_page+0x3d/0x50
[302338.512433] [<ffffffff80535c97>] __wait_on_bit+0x57/0x80
[302338.512437] [<ffffffff801a4c5e>] wait_on_page_bit+0x6e/0x80
[302338.512441] [<ffffffff80155290>] ? wake_bit_function+0x0/0x40
[302338.512446] [<ffffffff801ae770>] ? pagevec_lookup_tag+0x20/0x30
[302338.512450] [<ffffffff801a5055>] wait_on_page_writeback_range+0xf5/0x190
[302338.512454] [<ffffffff801a5117>] filemap_fdatawait+0x27/0x30
[302338.512460] [<ffffffff803417cd>] xfs_wait_on_pages+0x2d/0x40
[302338.512465] [<ffffffff80339b76>] xfs_fsync+0x46/0x1b0
[302338.512468] [<ffffffff801a5173>] ? __filemap_fdatawrite_range+0x53/0x60
[302338.512473] [<ffffffff8034160c>] xfs_file_fsync+0x6c/0x80
[302338.512479] [<ffffffff80204a66>] vfs_fsync+0x86/0xf0
[302338.512482] [<ffffffff80204b09>] do_fsync+0x39/0x60
[302338.512487] [<ffffffff80204b5b>] sys_fsync+0xb/0x10
[302338.512491] [<ffffffff801099a8>] system_call_fastpath+0x16/0x1b
[302338.512496] [<ffffffff80109940>] ? system_call+0x0/0x52
[302338.512502] INFO: task memcached:9869 blocked for more than 120 seconds.
[302338.512505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[302338.512509] memcached D ffffffff80547400 0 9869 9600 0x00000000
[302338.512515] ffff88044431bc78 0000000000000282 0000000000000001 ffff88044431bbf8
[302338.512521] ffff880443f41df0 ffff88044431bc40 000000000000a380 ffff880443de07e8
[302338.512528] 000000000000a380 ffff88044431bc28 000000000000a380 000000000000a380
[302338.512534] Call Trace:
[302338.512540] [<ffffffffa006a616>] ? dm_get_table+0x46/0x60 [dm_mod]
[302338.512548] [<ffffffffa006a79b>] ? dm_unplug_all+0x4b/0x70 [dm_mod]
[302338.512552] [<ffffffff801a4ab0>] ? sync_page+0x0/0x50
[302338.512556] [<ffffffff805357d8>] io_schedule+0x28/0x40
[302338.512560] [<ffffffff801a4aed>] sync_page+0x3d/0x50
[302338.512563] [<ffffffff80535c97>] __wait_on_bit+0x57/0x80
[302338.512568] [<ffffffff801a4c5e>] wait_on_page_bit+0x6e/0x80
[302338.512573] [<ffffffff80155290>] ? wake_bit_function+0x0/0x40
[302338.512577] [<ffffffff801ae770>] ? pagevec_lookup_tag+0x20/0x30
[302338.512581] [<ffffffff801a5055>] wait_on_page_writeback_range+0xf5/0x190
[302338.512585] [<ffffffff801a5117>] filemap_fdatawait+0x27/0x30
[302338.512590] [<ffffffff803417cd>] xfs_wait_on_pages+0x2d/0x40
[302338.512594] [<ffffffff80339b76>] xfs_fsync+0x46/0x1b0
[302338.512597] [<ffffffff801a5173>] ? __filemap_fdatawrite_range+0x53/0x60
[302338.512603] [<ffffffff8034160c>] xfs_file_fsync+0x6c/0x80
[302338.512607] [<ffffffff80204a66>] vfs_fsync+0x86/0xf0
[302338.512610] [<ffffffff80204b09>] do_fsync+0x39/0x60
[302338.512614] [<ffffffff80204b3e>] sys_fdatasync+0xe/0x20
[302338.512618] [<ffffffff801099a8>] system_call_fastpath+0x16/0x1b
[302338.512622] [<ffffffff80109940>] ? system_call+0x0/0x52
This happened more than once. On a previous time, a process logger caught beam.smp at 100% CPU, AND memcached nearly 100%.
Logs will be provided in a private comment.
On the latest one, the first failing node has dmesg ouput like this:
[302338.512230] INFO: task xfssyncd:8805 blocked for more than 120 seconds.
[302338.512241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[302338.512246] xfssyncd D ffffffff80547400 0 8805 2 0x00000000
[302338.512252] ffff880444055be0 0000000000000246 ffff880443d1bb60 ffff880444055b60
[302338.512259] ffff8804423677e8 ffff880444055ba8 000000000000a380 ffff880443becb28
[302338.512265] 000000000000a380 ffff880444055b90 000000000000a380 000000000000a380
[302338.512271] Call Trace:
[302338.512280] [<ffffffff80535a45>] schedule_timeout+0x175/0x1d0
[302338.512286] [<ffffffff8020b865>] ? __bio_add_page+0x205/0x270
[302338.512291] [<ffffffff80534c59>] wait_for_common+0xf9/0x210
[302338.512297] [<ffffffff80130180>] ? default_wake_function+0x0/0x10
[302338.512301] [<ffffffff80534e08>] wait_for_completion+0x18/0x20
[302338.512306] [<ffffffff80348a89>] xfs_reclaim_inode+0x139/0x1a0
[302338.512310] [<ffffffff803499a3>] xfs_reclaim_inode_now+0x93/0xa0
[302338.512314] [<ffffffff80349910>] ? xfs_reclaim_inode_now+0x0/0xa0
[302338.512318] [<ffffffff80349502>] xfs_inode_ag_walk+0x72/0xd0
[302338.512322] [<ffffffff80349910>] ? xfs_reclaim_inode_now+0x0/0xa0
[302338.512326] [<ffffffff803495c7>] xfs_inode_ag_iterator+0x67/0xa0
[302338.512330] [<ffffffff80349614>] xfs_reclaim_inodes+0x14/0x20
[302338.512334] [<ffffffff80349679>] xfs_sync_worker+0x59/0x90
[302338.512337] [<ffffffff80348d07>] xfssyncd+0x177/0x220
[302338.512341] [<ffffffff80348b90>] ? xfssyncd+0x0/0x220
[302338.512347] [<ffffffff80154ca6>] kthread+0xa6/0xb0
[302338.512353] [<ffffffff8010a57a>] child_rip+0xa/0x20
[302338.512358] [<ffffffff80154c00>] ? kthread+0x0/0xb0
[302338.512362] [<ffffffff8010a570>] ? child_rip+0x0/0x20
[302338.512368] INFO: task beam.smp:9594 blocked for more than 120 seconds.
[302338.512372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[302338.512376] beam.smp D ffffffff801a4ab0 0 9594 1 0x00000000
[302338.512381] ffff88044389fc78 0000000000000286 0000000000000001 ffff88044389fbf8
[302338.512388] ffff88044458e790 ffff88044389fc40 000000000000a380 ffff88044389c3e8
[302338.512394] 000000000000a380 ffff88044389fc28 000000000000a380 000000000000a380
[302338.512399] Call Trace:
[302338.512409] [<ffffffffa006a616>] ? dm_get_table+0x46/0x60 [dm_mod]
[302338.512417] [<ffffffffa006a79b>] ? dm_unplug_all+0x4b/0x70 [dm_mod]
[302338.512422] [<ffffffff801a4ab0>] ? sync_page+0x0/0x50
[302338.512426] [<ffffffff805357d8>] io_schedule+0x28/0x40
[302338.512430] [<ffffffff801a4aed>] sync_page+0x3d/0x50
[302338.512433] [<ffffffff80535c97>] __wait_on_bit+0x57/0x80
[302338.512437] [<ffffffff801a4c5e>] wait_on_page_bit+0x6e/0x80
[302338.512441] [<ffffffff80155290>] ? wake_bit_function+0x0/0x40
[302338.512446] [<ffffffff801ae770>] ? pagevec_lookup_tag+0x20/0x30
[302338.512450] [<ffffffff801a5055>] wait_on_page_writeback_range+0xf5/0x190
[302338.512454] [<ffffffff801a5117>] filemap_fdatawait+0x27/0x30
[302338.512460] [<ffffffff803417cd>] xfs_wait_on_pages+0x2d/0x40
[302338.512465] [<ffffffff80339b76>] xfs_fsync+0x46/0x1b0
[302338.512468] [<ffffffff801a5173>] ? __filemap_fdatawrite_range+0x53/0x60
[302338.512473] [<ffffffff8034160c>] xfs_file_fsync+0x6c/0x80
[302338.512479] [<ffffffff80204a66>] vfs_fsync+0x86/0xf0
[302338.512482] [<ffffffff80204b09>] do_fsync+0x39/0x60
[302338.512487] [<ffffffff80204b5b>] sys_fsync+0xb/0x10
[302338.512491] [<ffffffff801099a8>] system_call_fastpath+0x16/0x1b
[302338.512496] [<ffffffff80109940>] ? system_call+0x0/0x52
[302338.512502] INFO: task memcached:9869 blocked for more than 120 seconds.
[302338.512505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[302338.512509] memcached D ffffffff80547400 0 9869 9600 0x00000000
[302338.512515] ffff88044431bc78 0000000000000282 0000000000000001 ffff88044431bbf8
[302338.512521] ffff880443f41df0 ffff88044431bc40 000000000000a380 ffff880443de07e8
[302338.512528] 000000000000a380 ffff88044431bc28 000000000000a380 000000000000a380
[302338.512534] Call Trace:
[302338.512540] [<ffffffffa006a616>] ? dm_get_table+0x46/0x60 [dm_mod]
[302338.512548] [<ffffffffa006a79b>] ? dm_unplug_all+0x4b/0x70 [dm_mod]
[302338.512552] [<ffffffff801a4ab0>] ? sync_page+0x0/0x50
[302338.512556] [<ffffffff805357d8>] io_schedule+0x28/0x40
[302338.512560] [<ffffffff801a4aed>] sync_page+0x3d/0x50
[302338.512563] [<ffffffff80535c97>] __wait_on_bit+0x57/0x80
[302338.512568] [<ffffffff801a4c5e>] wait_on_page_bit+0x6e/0x80
[302338.512573] [<ffffffff80155290>] ? wake_bit_function+0x0/0x40
[302338.512577] [<ffffffff801ae770>] ? pagevec_lookup_tag+0x20/0x30
[302338.512581] [<ffffffff801a5055>] wait_on_page_writeback_range+0xf5/0x190
[302338.512585] [<ffffffff801a5117>] filemap_fdatawait+0x27/0x30
[302338.512590] [<ffffffff803417cd>] xfs_wait_on_pages+0x2d/0x40
[302338.512594] [<ffffffff80339b76>] xfs_fsync+0x46/0x1b0
[302338.512597] [<ffffffff801a5173>] ? __filemap_fdatawrite_range+0x53/0x60
[302338.512603] [<ffffffff8034160c>] xfs_file_fsync+0x6c/0x80
[302338.512607] [<ffffffff80204a66>] vfs_fsync+0x86/0xf0
[302338.512610] [<ffffffff80204b09>] do_fsync+0x39/0x60
[302338.512614] [<ffffffff80204b3e>] sys_fdatasync+0xe/0x20
[302338.512618] [<ffffffff801099a8>] system_call_fastpath+0x16/0x1b
[302338.512622] [<ffffffff80109940>] ? system_call+0x0/0x52