SOLR-18298: ZkController.OnReconnect Is Trigger Unnecessarily#4577
SOLR-18298: ZkController.OnReconnect Is Trigger Unnecessarily#4577linxiaokun528 wants to merge 2 commits into
Conversation
Curator's RECONNECTED event is different from the previous RECONNECTED event. Before Solr10, the OnReconnect is only triggered after a reconnection from a session expiration. Check the following https://github.com/apache/solr/blob/fdb5314279657f7895a90123436d834e81ea3157/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java#L165 https://github.com/apache/solr/blob/fdb5314279657f7895a90123436d834e81ea3157/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java#L199 But Curator's RECONNECTED event is triggered every time a Solr node is disconnected from a ZooKeeper instance and reconnected to another ZooKeeper instance. Therefore, currently ZkController.onReconnect is invoked every time a Solr node reconnects to a ZooKeeper instance without a session expiration, which is a huge overhead, especially when we need to rolling-restart a ZooKeeper Cluster. It can take more than 10 minutes for a small Solr cluster to level out. Similiarly, now ZkController.onDisconnect is triggered just after a disconnection from a Zookeeper instance. It should only be triggered after a session expiration.
ercsonusharma
left a comment
There was a problem hiding this comment.
Thanks for tracking this down. It's fine to keep the current direction if you want the clean removal, but flagging that the same fix is achievable with much less surface change IMO.
SolrCuratorEvent is more machinery than needed. It's an enum + abstract of() + a new EventAction type that's really just Runnable renamed. The same behavior can live directly on the existing OnReconnect/OnDisconnect callbacks and no new type.
The AtomicBoolean that makes it correct just lives in the returned closure.
FWIW, The public OnReconnect/OnDisconnect interfaces published in solrj-zookeeper, so removing them is a breaking change. The bug is purely in the detection logic which is fixable while keeping the interfaces intact.
Description
Curator's RECONNECTED event is different from the previous RECONNECTED event. Before Solr10, the OnReconnect is only triggered after a reconnection from a session expiration. Check the following
solr/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java
Line 165 in fdb5314
solr/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java
Line 199 in fdb5314
But Curator's RECONNECTED event is triggered every time a Solr node is disconnected from a ZooKeeper instance and reconnected to another ZooKeeper instance.
Therefore, currently ZkController.onReconnect is invoked every time a Solr node reconnects to a ZooKeeper instance without a session expiration, which is a huge overhead, especially when we need to rolling-restart a ZooKeeper Cluster. It can take more than 10 minutes for a small Solr cluster to level out.
Similiarly, now ZkController.onDisconnect is triggered just after a disconnection from a Zookeeper instance. It should only be triggered after a session expiration.
Solution
I created a SolrCuratorEvent class as a Bridge. Now it only has two event types: EXPIRED_RECONNECTION and SESSION_EXPIRATION.
EXPIRED_RECONNECTION: Will only be triggered when Solr reconnects Zookeeper from a session expiration
SESSION_EXPIRATION: Will be triggered when a session expiration happens (The same to Curator's LOST event.)
We should use SolrCuratorEvent instead of Curator's event and we can add more types into it whenever needed.
I also renamed some methods to make them clearer:
ZkController.onReconnect -> ZkController.onExpiredReconnection
ZkController.onDisconnect -> ZkController.onSessionExpiration
Tests
In ZkControllerTest.java,
testZkDisconnectionEventsto make sure when there is a session expiration, ZkController.onSessionExpiration is triggered.testZkReconnectionEventsto make sure when Solr reconnects to ZooKeeper from a session expiration, ZkController.onExpiredReconnection is triggered.Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.