Skip to content

SOLR-18298: ZkController.OnReconnect Is Trigger Unnecessarily#4577

Open
linxiaokun528 wants to merge 2 commits into
apache:mainfrom
linxiaokun528:SOLR-18298
Open

SOLR-18298: ZkController.OnReconnect Is Trigger Unnecessarily#4577
linxiaokun528 wants to merge 2 commits into
apache:mainfrom
linxiaokun528:SOLR-18298

Conversation

@linxiaokun528

@linxiaokun528 linxiaokun528 commented Jul 1, 2026

Copy link
Copy Markdown

Description

Curator's RECONNECTED event is different from the previous RECONNECTED event. Before Solr10, the OnReconnect is only triggered after a reconnection from a session expiration. Check the following

"Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...");

But Curator's RECONNECTED event is triggered every time a Solr node is disconnected from a ZooKeeper instance and reconnected to another ZooKeeper instance.

Therefore, currently ZkController.onReconnect is invoked every time a Solr node reconnects to a ZooKeeper instance without a session expiration, which is a huge overhead, especially when we need to rolling-restart a ZooKeeper Cluster. It can take more than 10 minutes for a small Solr cluster to level out.

Similiarly, now ZkController.onDisconnect is triggered just after a disconnection from a Zookeeper instance. It should only be triggered after a session expiration.

Solution

I created a SolrCuratorEvent class as a Bridge. Now it only has two event types: EXPIRED_RECONNECTION and SESSION_EXPIRATION.
EXPIRED_RECONNECTION: Will only be triggered when Solr reconnects Zookeeper from a session expiration
SESSION_EXPIRATION: Will be triggered when a session expiration happens (The same to Curator's LOST event.)

We should use SolrCuratorEvent instead of Curator's event and we can add more types into it whenever needed.

I also renamed some methods to make them clearer:
ZkController.onReconnect -> ZkController.onExpiredReconnection
ZkController.onDisconnect -> ZkController.onSessionExpiration

Tests

In ZkControllerTest.java,

  1. I added a method testZkDisconnectionEvents to make sure when there is a session expiration, ZkController.onSessionExpiration is triggered.
  2. I added a method testZkReconnectionEvents to make sure when Solr reconnects to ZooKeeper from a session expiration, ZkController.onExpiredReconnection is triggered.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide (This change restores the original behavior (a performance fix), and does not introduce any new user-facing features or configuration changes. Therefore, I believe no Reference Guide update is required.)
  • I have added a changelog entry for my change

Curator's RECONNECTED event is different from the previous RECONNECTED event. Before Solr10, the OnReconnect is only triggered after a reconnection from a session expiration. Check the following

https://github.com/apache/solr/blob/fdb5314279657f7895a90123436d834e81ea3157/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java#L165
https://github.com/apache/solr/blob/fdb5314279657f7895a90123436d834e81ea3157/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java#L199

But Curator's RECONNECTED event is triggered every time a Solr node is disconnected from a ZooKeeper instance and reconnected to another ZooKeeper instance.

Therefore, currently ZkController.onReconnect is invoked every time a Solr node reconnects to a ZooKeeper instance without a session expiration, which is a huge overhead, especially when we need to rolling-restart a ZooKeeper Cluster. It can take more than 10 minutes for a small Solr cluster to level out.

Similiarly, now ZkController.onDisconnect is triggered just after a disconnection from a Zookeeper instance. It should only be triggered after a session expiration.

@ercsonusharma ercsonusharma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down. It's fine to keep the current direction if you want the clean removal, but flagging that the same fix is achievable with much less surface change IMO.

SolrCuratorEvent is more machinery than needed. It's an enum + abstract of() + a new EventAction type that's really just Runnable renamed. The same behavior can live directly on the existing OnReconnect/OnDisconnect callbacks and no new type.
The AtomicBoolean that makes it correct just lives in the returned closure.

FWIW, The public OnReconnect/OnDisconnect interfaces published in solrj-zookeeper, so removing them is a breaking change. The bug is purely in the detection logic which is fixable while keeping the interfaces intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants