Uploaded image for project: 'Spring XD'
  1. Spring XD
  2. XD-3711

XD / Zookeeper connection lost.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: To Do
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.2.1
    • Fix Version/s: None
    • Component/s: Runtime
    • Story Points:
      5
    • Rank (Obsolete):
      9223372036854775807
    • Out of Scope:
      Hide

      If you need more informations I can provide full logs.

      Show
      If you need more informations I can provide full logs.

      Description

      XD container loose connection with Zookeeper.

      I'm in a distributed environnement:

      • 3 XD container nodes (1.2.1)
      • 1 XD admin
      • 3 Zookeeper
      • 3 RabbitMQ
      • 3 Redis/Sentinel

      Logs:

      zookeeper.log

      2015-11-25 06:53:07,235 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:[email protected]] - Established session 0x251250651910006 with negotiated timeout 40000 for client /172.20.1.9:58070
      2015-11-25 06:54:08,525 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:[email protected]] - caught end of stream exception
      EndOfStreamException: Unable to read additional data from client sessionid 0x251250651910006, likely client has closed socket
              at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
              at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
              at java.lang.Thread.run(Thread.java:745)
      2015-11-25 06:54:08,621 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:[email protected]] - Closed socket connection for client /172.20.1.9:58070 which had sessionid 0x251250651910006
      

      container.log

      2015-11-25T06:53:37+0100 1.2.1.RELEASE ERROR main-EventThread curator.ConnectionState - Connection timed out for connection string (172.20.1.1:2181,172.20.1.8:2181,172.20.1.9:2181) and timeout (30000) / elapsed (34187)
      org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
              at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) [curator-client-2.6.0.jar:na]
              at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) [curator-client-2.6.0.jar:na]
              at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) [curator-client-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) [curator-client-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) [curator-framework-2.6.0.jar:na]
              at org.springframework.xd.dirt.server.container.DeploymentListener$StreamModuleWatcher.process(DeploymentListener.java:596) [spring-xd-dirt-1.2.1.RELEASE.jar:1.2.1.RELEASE]
              at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) [curator-framework-2.6.0.jar:na]
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) [zookeeper-3.4.6.jar:3.4.6-1569965]
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965]
      2015-11-25T06:53:37+0100 1.2.1.RELEASE ERROR CuratorFramework-0 curator.ConnectionState - Connection timed out for connection string (172.20.1.1:2181,172.20.1.8:2181,172.20.1.9:2181) and timeout (30000) / elapsed (34189)
      org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
              at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) [curator-client-2.6.0.jar:na]
              at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) [curator-client-2.6.0.jar:na]
              at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) [curator-client-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_60]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
              at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
      2015-11-25T06:53:39+0100 1.2.1.RELEASE ERROR CuratorFramework-0 curator.ConnectionState - Connection timed out for connection string (172.20.1.1:2181,172.20.1.8:2181,172.20.1.9:2181) and timeout (30000) / elapsed (36191)
      org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
              at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) [curator-client-2.6.0.jar:na]
              at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) [curator-client-2.6.0.jar:na]
              at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) [curator-client-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
      [...]
      2015-11-25T06:54:34+0100 1.2.1.RELEASE INFO ConnectionStateManager-0 container.ContainerRegistrar - Waiting for supervisor to clean up prior deployments (elapsed time 26 seconds)...
      2015-11-25T06:55:05+0100 1.2.1.RELEASE INFO ConnectionStateManager-0 container.ContainerRegistrar - Waiting for supervisor to clean up prior deployments (elapsed time 57 seconds)...
      2015-11-25T06:56:05+0100 1.2.1.RELEASE INFO ConnectionStateManager-0 container.ContainerRegistrar - Waiting for supervisor to clean up prior deployments (elapsed time 117 seconds)...
      2015-11-25T06:57:05+0100 1.2.1.RELEASE INFO ConnectionStateManager-0 container.ContainerRegistrar - Waiting for supervisor to clean up prior deployments (elapsed time 177 seconds)...
      

      admin.log

      2015-11-25T06:54:23+0100 1.2.1.RELEASE ERROR DeploymentSupervisor-0 cache.PathChildrenCache -
      org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /xd/deployments/modules/allocated/b1de9530-1837-42c0-a6bc-840b1b15aefc/JOB_TRIGGER.source.trigger.1
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
              at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
              at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.6.0.jar:na]
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287) ~[curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.6.0.jar:na]
              at org.springframework.xd.dirt.server.admin.deployment.zk.DepartingContainerModuleRedeployer.deployModules(DepartingContainerModuleRedeployer.java:116) ~[spring-xd-dirt-1.2.1.RELEASE.jar:1.2.1.RELEASE]
              at org.springframework.xd.dirt.server.admin.deployment.zk.ContainerListener.childEvent(ContainerListener.java:140) ~[spring-xd-dirt-1.2.1.RELEASE.jar:1.2.1.RELEASE]
              at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:509) [curator-recipes-2.6.0.jar:na]
              at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:503) [curator-recipes-2.6.0.jar:na]
              at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92) [curator-framework-2.6.0.jar:na]
              at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:na]
              at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83) [curator-framework-2.6.0.jar:na]
              at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:500) [curator-recipes-2.6.0.jar:na]
              at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.6.0.jar:na]
              at org.apache.curator.framework.recipes.cache.PathChildrenCache$10.run(PathChildrenCache.java:762) [curator-recipes-2.6.0.jar:na]
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_60]
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_60]
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_60]
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_60]
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_60]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
              at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
      

      If a module is deployed on the node which has lost the connection, it's not redeployed on one of the two others.

      The only difference between node, is that the node in error has less memory.

      When this occurs, node doesn't appear anymore on the admin ui. And deployed streams do not appear as incomplete, but they should if a node has disappear and deployment property module.*.count is set to the number of nodes.

      Thanks.

      Mickaël

        Attachments

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            mgervais GERVAIS Mickaël
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: