Affects Version/s: 1.1.1
Pull Request URL:
SpringXD container has to recover from a connection loss to Zookeeper cluster by either:
SpringXD container has to recover from a connection loss to Zookeeper cluster by either: re-deploying modules to other available spring-xd containers in the cluster or by successful recovery from the CONNECTION_SUSPEND, CONNECTION_RECONECTED and CHILD_REMOVE events, spring-xd admin should properly identify re-connect of the lost container and handle modules re-deployment properly
- re-deploying modules to other available spring-xd containers in the cluster or
- by successful recovery from the CONNECTION_SUSPEND, CONNECTION_RECONECTED and CHILD_REMOVE events, spring-xd admin should properly identify re-connect of the lost container and handle modules re-deployment properly
We are running Spring XD 1.1.1 in our production environment and Zookeeper 3.4.5. Zookeeper is running in failover mode and consists of three independent nodes set up on three separate VMs. From time to time we get "Connection to Zookeeper Suspended" event which causes one of the containers in the cluster to be removed from the SpringXD cluster. Modules being deployed on this removed node fail to be re-deployed to other containers in the cluster.
- SpringXD 1.1.1
- Zookeeper 3.4.5 and 3.4.6
Cluster set up in PROD environment where error occurs:
- 4 Spring-XD dedicated servers
- 4 spring-xd containers (each running on designated server )
- 2 spring-xd admins ( each running alongside one spring-xd container)
- 3 Zookeeper nodes ( 3 designated servers on PAITO environment )
Cluster set up in TEST environment where error also occurred:
- 2 Spring-XD dedicated servers running one spring-xd container and one spring-xd admin each
- 3 Zookeeper nodes running on 3 dedicated servers (PAITO Test environment)
Cluster set up to reproduce error found in PROD environment:
- 1 spring-xd admin
- 3 spring xd-containers (each running on a designated VM )
- 3 zookeeper servers running on one VM
Steps to reproduce:
1) Set up three node Zookeeper cluster. Attached is example zoo.cfg, we are using default configuration values. In this particular test case we run all Zookeeper nodes on a single VM as we were not testing network layer interruptions.
2) Set up one Spring XD admin node. Please note that we have also observed this on two node Spring XD admin cluster.
3) Set up three Spring XD container nodes. All of them belong to one group (SA) and two of them also belong to second group (HA1). This is configured in $XD_HOME/config/servers.yml however so far group configuration never influenced test outcome.
4) Create and deploy a test stream using following XD Shell commands:
stream create --name test-zookeeper-failover --definition "syslog-udp --port=5140 | transform | file --dir='/opt/pivotal/spring-xd/xd/output'"
stream deploy --name test-zookeeper-failover --properties "module.syslog-udp.criteria=groups.contains('HA1'),module.syslog-udp.count=2,module.file.criteria=groups.contains('SA'),module.file.count=3,module.transform.criteria=groups.contains('SA')"
5) Ensure that test stream works and handles traffic on UDP port 5140
6) Shutdown one of the Zookeeper nodes by issuing a stop command.
7) Two Spring XD containers were not affected and remained in Spring XD cluster.
8) One Spring XD container was kicked out of Spring XD cluster and was no longer visible on Spring XD admin Web UI. Modules previously deployed to this container were not redeployed to other cluster members.
9) On the failed Spring XD container we have observed CONNECTION_SUSPEND, CONNECTION_RECONECTED and CHILD_REMOVE Zookeeper events (attached is container-log.txt). Please note that Java process is still running and we see “ConnectionStateManager-0 server.ContainerRegistrar - Waiting for supervisor to clean up prior deployments” messages.
10) Spring XD admin failed with exception in DepartingContainerModuleRedeployer (attached is admin-log.txt).
11) We have observed that departing container node in Zookeeper (/sa/deployments/modules/allocated/1d3fd4cc-5a70-47ed-b4f3-22deef1f4d4f/) had no children. We did this after few minutes so we are not sure at which point it was cleared.
12) Restarting failed Spring XD container fixed the problem, modules were correctly redeployed.
Exception from point 10 is very similar to
XD-1983 and this code was rewritten in XD-2004.