Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling session.refreshSchema() may hang #343

Open
Bouncheck opened this issue Sep 13, 2024 · 0 comments
Open

Calling session.refreshSchema() may hang #343

Bouncheck opened this issue Sep 13, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Bouncheck
Copy link
Collaborator

Bouncheck commented Sep 13, 2024

It seems that in some cases refreshSchema() can hang the thread calling it and not recover. Easy way to trigger this is to call it when connection to the cluster was just lost. It is expected that the refresh won't succeed in such case, but I'm assuming it should not hang the thread.

Steps to reproduce

Create the following test method (it's easier as a test because reproducer uses test-infra dependency):

  @Test
  public void reproducer() {
    DriverConfigLoader loader =
        new DefaultProgrammaticDriverConfigLoaderBuilder()
            .withBoolean(TypedDriverOption.RECONNECT_ON_INIT.getRawOption(), true)
            .withStringList(
                TypedDriverOption.CONTACT_POINTS.getRawOption(),
                Collections.singletonList("127.0.1.1:9042"))
            .build();

    CqlSessionBuilder builder = new CqlSessionBuilder().withConfigLoader(loader);
    CqlSession session = null;
    Collection<Node> nodes;
    int counter = 0;
    while (counter < 2) {
      counter++;
      try (CcmBridge ccmBridge =
               CcmBridge.builder().withNodes(3).withIpPrefix("127.0." + counter + ".").build()) {
        ccmBridge.create();
        ccmBridge.start();
        if (session == null) {
          session = builder.build();
        }
        long endTime = System.currentTimeMillis() + CLUSTER_WAIT_SECONDS * 1000;
        while (System.currentTimeMillis() < endTime) {
          try {
            nodes = session.getMetadata().getNodes().values();
            int upNodes = 0;
            for (Node node : nodes) {
              if (node.getUpSinceMillis() > 0) {
                upNodes++;
              }
            }
            if (upNodes == 3) {
              break;
            }
            System.out.println("Before refreshSchema call (n==" + counter +")");
            session.refreshSchema();
            System.out.println("#   After refreshSchema call (n==" + counter +")");
            Thread.sleep(1000);
          } catch (InterruptedException e) {
            break;
          }
        }
      }
    }
  } 

Add following logger to logback-test.xml to see the exception:

<logger name="com.datastax.oss.driver.internal.core.metadata" level="DEBUG"/>

And run the method

Result

The test will hang at some point at refreshSchema() call. The following exception should be visible in the logs

12:06:23.280 [main] INFO  c.d.o.d.api.testinfra.ccm.CcmBridge - Executing: [ccm, remove, --config-dir=/tmp/ccm447078114819940528]
12:06:23.586 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.2:9042, hostId=ef860132-71e5-4d28-b89b-77ce6003e37f, hashCode=7811d31e))
12:06:23.587 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(RECONNECTION_STARTED, Node(endPoint=/127.0.1.2:9042, hostId=ef860132-71e5-4d28-b89b-77ce6003e37f, hashCode=7811d31e))
12:06:23.588 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.2:9042, hostId=ef860132-71e5-4d28-b89b-77ce6003e37f, hashCode=7811d31e))
12:06:23.588 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Transitioning Node(endPoint=/127.0.1.2:9042, hostId=ef860132-71e5-4d28-b89b-77ce6003e37f, hashCode=7811d31e) UP=>DOWN (because it was reconnecting and lost its last connection)
12:06:23.594 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.3:9042, hostId=d6fd5558-7343-4368-9b90-9448a66e57a1, hashCode=2bd8a170))
12:06:23.595 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(RECONNECTION_STARTED, Node(endPoint=/127.0.1.3:9042, hostId=d6fd5558-7343-4368-9b90-9448a66e57a1, hashCode=2bd8a170))
12:06:23.595 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.3:9042, hostId=d6fd5558-7343-4368-9b90-9448a66e57a1, hashCode=2bd8a170))
12:06:23.596 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Transitioning Node(endPoint=/127.0.1.3:9042, hostId=d6fd5558-7343-4368-9b90-9448a66e57a1, hashCode=2bd8a170) UP=>DOWN (because it was reconnecting and lost its last connection)
12:06:23.596 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.1:9042, hostId=2e908145-3172-46e6-9206-a27583e1f867, hashCode=3a40b8e2))
12:06:23.596 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(RECONNECTION_STARTED, Node(endPoint=/127.0.1.1:9042, hostId=2e908145-3172-46e6-9206-a27583e1f867, hashCode=3a40b8e2))
12:06:23.596 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.1:9042, hostId=2e908145-3172-46e6-9206-a27583e1f867, hashCode=3a40b8e2))
12:06:23.597 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Processing ChannelEvent(CLOSED, Node(endPoint=/127.0.1.1:9042, hostId=2e908145-3172-46e6-9206-a27583e1f867, hashCode=3a40b8e2))
12:06:23.597 [s0-admin-1] DEBUG c.d.o.d.i.c.m.NodeStateManager - [s0] Transitioning Node(endPoint=/127.0.1.1:9042, hostId=2e908145-3172-46e6-9206-a27583e1f867, hashCode=3a40b8e2) UP=>DOWN (because it was reconnecting and lost its last connection)
12:06:24.649 [main] INFO  c.d.o.d.api.testinfra.ccm.CcmBridge - Executing: [ccm, create, ccm_1, -i, 127.0.2., -n, 3:0, -v, release:6.0.1, --scylla, --config-dir=/tmp/ccm7441250503244669492]
12:06:25.089 [Exec Stream Pumper] INFO  c.d.o.d.api.testinfra.ccm.CcmBridge - ccmout> Current cluster is now: ccm_1
12:06:25.154 [main] INFO  c.d.o.d.api.testinfra.ccm.CcmBridge - Executing: [ccm, updateconf, auto_snapshot:false, --config-dir=/tmp/ccm7441250503244669492]
12:06:25.568 [main] INFO  c.d.o.d.api.testinfra.ccm.CcmBridge - Executing: [ccm, start, --wait-for-binary-proto, --wait-other-notice, --config-dir=/tmp/ccm7441250503244669492]
Before refreshSchema call (n==2)
12:06:32.124 [s0-admin-0] DEBUG c.d.o.d.i.c.metadata.MetadataManager - [s0] Starting schema refresh
12:06:32.124 [s0-admin-0] DEBUG c.d.o.d.i.c.m.SchemaAgreementChecker - [s0] Checking schema agreement
12:06:32.126 [s0-io-3] DEBUG c.d.o.d.i.c.m.SchemaAgreementChecker - [s0] Error while checking schema agreement, completing now (false)
java.util.concurrent.CompletionException: io.netty.channel.StacklessClosedChannelException
  at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
  at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
  at java.util.concurrent.CompletableFuture.biApply(CompletableFuture.java:1102)
  at java.util.concurrent.CompletableFuture$BiApply.tryFire(CompletableFuture.java:1084)
  at java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1034)
  at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
  at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
  at com.datastax.oss.driver.internal.core.adminrequest.AdminRequestHandler.setFinalError(AdminRequestHandler.java:206)
  at com.datastax.oss.driver.internal.core.adminrequest.AdminRequestHandler.onWriteComplete(AdminRequestHandler.java:150)
  at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
  at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
  at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
  at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
  at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
  at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
  at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:889)
  at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:875)
  at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:984)
  at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:868)
  at io.netty.channel.DefaultChannelPipeline.write(DefaultChannelPipeline.java:1015)
  at io.netty.channel.AbstractChannel.write(AbstractChannel.java:301)
  at com.datastax.oss.driver.internal.core.channel.DefaultWriteCoalescer$Flusher.runOnEventLoop(DefaultWriteCoalescer.java:102)
  at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  at java.lang.Thread.run(Thread.java:750)
Caused by: io.netty.channel.StacklessClosedChannelException: null
  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
@Bouncheck Bouncheck added the bug Something isn't working label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant