Skip to content

Failover when zookeeper instance is down #3

@dumityty

Description

@dumityty

This is more of a general question about failover using solarium cloud connecting to a few zookeeper instances.

I have configured 3 Zookeeper servers with 3 shards and 3 replicas, all working ok and able to connect to them.

My solarium cloud config is the following:

[
  'zkhosts' => 'HOST1:2181,HOST2:2181,HOST3:2181',
  'defaultcollection' => 'COLLECTION_NAME',
]

I am able to use solarium cloud and connect ok, perform queries, etc.

But after finally finishing configuring everything and connecting, I decided to test what would happen if one of my instance were to actually go down - the reason behind using SolrCloud in the first place.

I have tried the following scenarios: stop the server altogether, stop Zookeer on the server, stop Solr on the server but have Zookeeper running.

And I got to the following conclusions:

  1. If the server itself is completely down and solarim cloud happens to choose that host to direct the query to then I get "operation timeout" exception - I assume since port 2181 is not reachable at all so the timeout limit kicks in.

  2. If I stop zookeeper on the server and solarim cloud sends the request to that host then I get "connection loss" - I assume since port 2181 is reachable but the service is not running at all so the connection is not established?

  3. If zookeeper is running but I stop Solr on the server, then everything works fine - if solarium cloud sends a request to that host then zookeeper figures out that solr is down and directs the query to another instance which is up - so everything works fine in this case.

My question is whether it's actually possible to get it to failover correctly to the live instances in the first two scenarios? Or am I approaching this the wrong way? Or it's meant to behave that way and I should have failover at a different step?

Would a correct/possible solution be to stick a load balancer in front of the 3 zookeeper instances, and have the health check on port 2181 and if one of the zookeepers is not answering then don't direct any requests to it?
In that case my "zkhosts" would be "load_balancer_host:2181"

Not quite sure whether this question is suitable for this issue queue? or I should post it on stack overflow maybe?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions