Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix class/resource leak in :nessie-quarkus:test #9441

Closed
snazy opened this issue Aug 29, 2024 · 5 comments · Fixed by #9453
Closed

Fix class/resource leak in :nessie-quarkus:test #9441

snazy opened this issue Aug 29, 2024 · 5 comments · Fixed by #9453

Comments

@snazy
Copy link
Member

snazy commented Aug 29, 2024

Added a commit to #9433 to increase the max test-worker heap size to 3g. This makes the :nessie-quarkus:test task finish.

There is a resource leak, but that's already present with earlier Quarkus versions, so I suspect something that's present in all our tests is causing the leak, #9440 is one cause, but not the only one.

This issue is to investigate, find the root cause, fix it and bring the max test-worker heap size back to 2g.

@snazy
Copy link
Member Author

snazy commented Aug 29, 2024

So, good news first: there are

  • no thread leaks
  • no class loader leaks

But there are ThreadLocals leaking, which are loaded by the Quarkus test class-loaders, which then cause "all the classes" and resources to leak. Some classes that add ThreadLocals are the following. There are more - but the ones that are initialized with a Supplier seem to be the problematic ones.

  • io.netty.util.internal.InternalThreadLocalMap (cannot modify)
  • org.projectnessie.cel.parser.Unescape
  • org.apache.avro.Schema (cannot modify)

@snazy
Copy link
Member Author

snazy commented Aug 30, 2024

Discussion on Quarkus Zulip

@snazy
Copy link
Member Author

snazy commented Aug 30, 2024

The leak's caused by ThreadLocals attached by 3rd party libraries (Netty, Avro and some more) to the shared JUnit Test worker thread.

In JUnit, tests are run from the single Test worker thread (or a poll of threads, if JUnit is configured to do so). The QuarkusTestExtensions changes the thread's context-class-loader to use the correct Quarkus class loader (QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for ...).

If a test somehow "touches" a dependency that creates a thread local, that thread local is attached to the Test worker thread. Types in such a thread local reference the Quarkus test class loader.

Even though everything else is properly cleaned up before the next test w/ a different Quarkus test class loader starts, the thread locals are still referenced by the Test worker thread, which means that the objects/suppliers in those thread locals are still referenced, in turn the class loader and with that the whole set of classes. The result is eventually an OOM.

Here's an output that (should) illustrate the issue. It's generated by some hacky code that inspects the Test worker thread's threadlocals map.

THREAD LOCALS OF Thread[#1,Test worker,5,main]
jdk.internal.loader.ClassLoaders$PlatformClassLoader@42f5e9a6
   jdk.nio.zipfs.ZipFileSystem$IndexNode /org/assertj/core/error/UnsatisfiedRequirement.class , index: -1
QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for TestMetricsDisabled [V1] (QuarkusTest)@69b07727
   org.apache.avro.NameValidator$2 org.apache.avro.NameValidator$2@c8f14da
   org.apache.avro.Schema$$Lambda/0x00007feccca70000 org.apache.avro.Schema$$Lambda/0x00007feccca70000@5082ae8
   org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccce6af98 org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccce6af98@2f992272
QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for TestMetricsTags [V1] (QuarkusTest)@5cf93ce8
   io.netty.util.internal.InternalThreadLocalMap io.netty.util.internal.InternalThreadLocalMap@45d1828
   org.apache.avro.NameValidator$2 org.apache.avro.NameValidator$2@eecf044
   org.apache.avro.Schema$$Lambda/0x00007feccd87fd10 org.apache.avro.Schema$$Lambda/0x00007feccd87fd10@69e3d304
   org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccdc9a5a0 org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccdc9a5a0@3dd96fb3
jdk.internal.loader.ClassLoaders$AppClassLoader@76ed5528
   org.jboss.logmanager.ThreadLocalNDC$Stack 
QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for TestQuarkusRestInMemoryPersist [V1] (QuarkusTest)@681e913c
   org.apache.avro.NameValidator$2 org.apache.avro.NameValidator$2@69d2fc03
   org.apache.avro.Schema$$Lambda/0x00007feccb96ec18 org.apache.avro.Schema$$Lambda/0x00007feccb96ec18@423a5c9c
   org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccbe30fa8 org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccbe30fa8@2ef77fcc

@snazy
Copy link
Member Author

snazy commented Aug 30, 2024

Just documenting the extremely hacky code used to identify the issue...

public class TempBase {

  @BeforeAll
  public static void foo() throws Exception {
    System.out.println();
    System.out.println();
    System.out.println("*** START UP");
    System.out.println();
    System.out.println();

    dump();
  }

  @AfterAll
  public static void cleanup() throws Exception {
    System.out.println();
    System.out.println();
    System.out.println("*** CLEAN UP");
    System.out.println();
    System.out.println();

    dump();
  }

  private static void dump() throws Exception {
    Map<ClassLoader, List<Thread>> threadsByClassLoader = new IdentityHashMap<>();
    Thread t1 = null;
    for (Thread thread : Thread.getAllStackTraces().keySet()) {
      if (thread.threadId() == 1) {
        t1 = thread;
      }
      threadsByClassLoader
          .computeIfAbsent(thread.getContextClassLoader(), cl -> new ArrayList<>())
          .add(thread);
    }
    threadsByClassLoader.forEach(
        (cl, t) -> {
          if (cl == null) {
            return;
          }
          System.out.println(cl);
          t.sort(Comparator.comparing(Thread::getName).thenComparing(Thread::getId));
          for (Thread thread : t) {
            System.out.println("   " + thread.threadId() + " " + thread.getName());
          }
        });

    if (t1 != null) {
      dumpThreadLocals(t1);
    }

    Thread t = Thread.currentThread();
    if (t != t1) {
      dumpThreadLocals(t);
    }
  }

  private static void dumpThreadLocals(Thread t)
      throws NoSuchFieldException, IllegalAccessException {
    Map<ClassLoader, List<Object>> threadLocalsByClassLoader = new IdentityHashMap<>();
    Field threadLocalsField = Thread.class.getDeclaredField("threadLocals");
    threadLocalsField.setAccessible(true);
    Object threadLocals = threadLocalsField.get(t);
    Field tableField = threadLocals.getClass().getDeclaredField("table");
    tableField.setAccessible(true);
    Object[] entries = (Object[]) tableField.get(threadLocals);
    for (Object entry : entries) {
      if (entry != null) {
        Field valueField = entry.getClass().getDeclaredField("value");
        valueField.setAccessible(true);
        Object value = valueField.get(entry);
        if (value != null) {
          threadLocalsByClassLoader
              .computeIfAbsent(value.getClass().getClassLoader(), cl -> new ArrayList<>())
              .add(value);

          Field referentField = Reference.class.getDeclaredField("referent");
          referentField.setAccessible(true);
          Object referent = referentField.get(entry);
          if (referent != null
              && referent
                  .getClass()
                  .getName()
                  .equals("java.lang.ThreadLocal$SuppliedThreadLocal")) {
            Field supplierField = referent.getClass().getDeclaredField("supplier");
            supplierField.setAccessible(true);
            Object supplier = supplierField.get(referent);
            if (supplier != null) {
              threadLocalsByClassLoader
                  .computeIfAbsent(supplier.getClass().getClassLoader(), cl -> new ArrayList<>())
                  .add(supplier);
            }
          }
        }
      }
    }
    System.out.println("THREAD LOCALS OF " + t);
    threadLocalsByClassLoader.forEach(
        (cl, objs) -> {
          if (cl == null) {
            return;
          }
          System.out.println(cl);
          objs.sort(
              Comparator.comparing(o -> o.getClass().getName())
                  .thenComparing(System::identityHashCode));
          for (Object obj : objs) {
            System.out.println("   " + obj.getClass().getName() + " " + obj);
          }
        });
  }
}

snazy added a commit to snazy/nessie that referenced this issue Aug 30, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

Fixes projectnessie#9441
snazy added a commit to snazy/nessie that referenced this issue Aug 30, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in projectnessie#9433).

Fixes projectnessie#9441
snazy added a commit to snazy/nessie that referenced this issue Aug 30, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in projectnessie#9433).

Fixes projectnessie#9441
snazy added a commit to snazy/nessie that referenced this issue Aug 30, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class. The logic to distinguish multi-env from non-multi-env tests via `MultiEnvExtensionRegistry.registerExtension()` via test discovery is not 100% deterministic, it can add multi-env tests to the non-multi-env tests, so an additional check is needed there.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in projectnessie#9433).

Fixes projectnessie#9441
snazy added a commit to snazy/nessie that referenced this issue Aug 30, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class. The logic to distinguish multi-env from non-multi-env tests via `MultiEnvExtensionRegistry.registerExtension()` via test discovery is not perfect (but good enough), it can add multi-env tests to the non-multi-env tests, so an additional check is needed there.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in projectnessie#9433).

Fixes projectnessie#9441
snazy added a commit to snazy/nessie that referenced this issue Aug 30, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class. The logic to distinguish multi-env from non-multi-env tests via `MultiEnvExtensionRegistry.registerExtension()` via test discovery is not perfect (but good enough), it can add multi-env tests to the non-multi-env tests, so an additional check is needed there.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in projectnessie#9433).

Fixes projectnessie#9441
@snazy
Copy link
Member Author

snazy commented Aug 30, 2024

Related JUnit issue

snazy added a commit to snazy/nessie that referenced this issue Sep 4, 2024
See projectnessie#9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class. The logic to distinguish multi-env from non-multi-env tests via `MultiEnvExtensionRegistry.registerExtension()` via test discovery is not perfect (but good enough), it can add multi-env tests to the non-multi-env tests, so an additional check is needed there.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in projectnessie#9433).

Fixes projectnessie#9441
snazy added a commit that referenced this issue Sep 6, 2024
See #9441 for a complete description of the underlying problem. TL;DR is: `ThreadLocal`s from various 3rd party libraries leak into the single `Test worker` thread that runs all the tests, resulting in TL objects/suppliers from the various Quarkus test class loaders, eventually leading to nasty OOMs.

This change updates the `MultiEnvTestEngine` by using the new `ThreadPerTestClassExecutionExecutorService` and also "assimilate" really all tests, even the non-multi-env tests, so that those also run on a thread per test-class. The logic to distinguish multi-env from non-multi-env tests via `MultiEnvExtensionRegistry.registerExtension()` via test discovery is not perfect (but good enough), it can add multi-env tests to the non-multi-env tests, so an additional check is needed there.

Since each test class runs on "its own thread", the `ThreadLocal`s are registered on that thread. Once the test class finishes, the thread is disposed and its thread locals become eligible for garbage collection, which is what is needed.

The bump of the max-heap size for test workers is also reduced back to 2g (was changed in #9433).

Fixes #9441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant