When some nodes of cluster face with TCP connection fault, ocfs2 will
pick up a quorum to continue to work and other nodes will be fenced by
resetting host.
In order to decide which node should be fenced, ocfs2 leverages
o2quo_state::qs_holds. If that variable is reduced to zero, then a try
to decide if fence local node is performed. However, under a specific
scenario that local node is not disconnected from others at the same
time, above method has a problem to reduce ::qs_holds to zero.
Because, o2net 90s idle timer corresponding to different nodes is
triggered one after another.
node 2 node 3
90s idle timer elapses
clear ::qs_conn_bm
set hold
40s is passed
90 idle timer elapses
clear ::qs_conn_bm
set hold
still up timer elapses
clear hold (NOT to zero )
90s idle timer elapses AGAIN
still up timer elapses.
clear hold
still up timer elapses
To solve this issue, a node which has already be evicted from
::qs_conn_bm can't set hold again and again invoked from idle timer.
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F3F93B@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Changwei Ge <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
node, qs->qs_connected);
clear_bit(node, qs->qs_conn_bm);
+
+ if (test_bit(node, qs->qs_hb_bm))
+ o2quo_set_hold(qs, node);
}
mlog(0, "node %u, %d total\n", node, qs->qs_connected);
- if (test_bit(node, qs->qs_hb_bm))
- o2quo_set_hold(qs, node);
spin_unlock(&qs->qs_lock);
}