Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A (possibly mislabeled) OOM error occurs when a user has too many services and creates a new one #421

Open
phlg opened this issue Apr 29, 2024 · 0 comments

Comments

@phlg
Copy link
Contributor

phlg commented Apr 29, 2024

Hi,

One of our Onyxia users is frequently facing an error when creating a new service on the Onyxia UI, and after some rough testing, it seems that this could be caused by the fact this user has a large number of running service (24), which in turn has Onyxia API start that many threads to manage the listing request emanating from Onyxia UI.

Other users with fewer services couldn't reproduce the error. On the other hand, using kubectl exec to get a shell on the Onyxia API pod to try and run many helm get all ... in parallel ends up with the same error.

What's puzzling is that the error mentions some processes limit (ulimit -u) on the Golang side, and OOM Error on the Java side, but it doesn't seem look like we are limited on either of those sides (unlimited user processes as viewed in the container, and 4/16Go requests/limits for RAM) :

2024-04-29 10:29:29.214 runtime: may need to increase max user processes (ulimit -u)
2024-04-29 10:29:29.214 runtime: failed to create new OS thread (have 8 already; errno=11)
2024-04-29 10:29:29.214 fatal error: newosproc
2024-04-29 10:29:29.214 runtime: may need to increase max user processes (ulimit -u)
2024-04-29 10:29:29.214 runtime: failed to create new OS thread (have 7 already; errno=11)
2024-04-29 10:29:29.214 fatal error: newosproc
2024-04-29 10:29:29.214 runtime: may need to increase max user processes (ulimit -u)
2024-04-29 10:29:29.214 runtime: failed to create new OS thread (have 7 already; errno=11)
2024-04-29 10:29:29.213 fatal error: newosproc
2024-04-29 10:29:29.213 runtime: may need to increase max user processes (ulimit -u)
2024-04-29 10:29:29.213 runtime: failed to create new OS thread (have 5 already; errno=11)
2024-04-29 10:29:29.211 fatal error: newosproc
2024-04-29 10:29:29.211 runtime: may need to increase max user processes (ulimit -u)
2024-04-29 10:29:29.211 runtime: failed to create new OS thread (have 6 already; errno=11)
2024-04-24T07:09:37.528Z  WARN 7 --- [ool-worker-8074] i.g.i.h.service.HelmInstallService       : Exception occurre
org.zeroturnaround.exec.ProcessInitException: Could not execute [helm, get, all, vscode-python-273990, --namespace, REDACTED]. Error=11, Resource temporarily unavailable        at org.zeroturnaround.exec.ProcessInitException.newInstance(ProcessInitException.java:80) ~[zt-exec-1.12.jar:na]        at org.zeroturnaround.exec.ProcessExecutor.invokeStart(ProcessExecutor.java:1002) ~[zt-exec-1.12.jar:na]        at org.zeroturnaround.exec.ProcessExecutor.startInternal(ProcessExecutor.java:970) ~[zt-exec-1.12.jar:na]        at org.zeroturnaround.exec.ProcessExecutor.execute(ProcessExecutor.java:906) ~[zt-exec-1.12.jar:na]        at io.github.inseefrlab.helmwrapper.utils.Command.execute(Command.java:73) ~[java-helm-wrapper-v2.5.0.jar:v2.5.0
        at io.github.inseefrlab.helmwrapper.service.HelmInstallService.getAll(HelmInstallService.java:139) ~[java-helm-wrapper-v2.5.0.jar:v2.5.0]        at fr.insee.onyxia.api.services.impl.HelmAppsService.getHelmApp(HelmAppsService.java:370) ~[classes/:v2.5.0]        at fr.insee.onyxia.api.services.impl.HelmAppsService.lambda$getUserServices$2(HelmAppsService.java:249) ~[classes/:v2.5.0]        at java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[na:na]        at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source) ~[na:na]        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[na:na]        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[na:na]        at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(Unknown Source) ~[na:na]        at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(Unknown Source) ~[na:na]        at java.base/java.util.stream.AbstractTask.compute(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.CountedCompleter.exec(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) ~[na:na]Caused by: java.io.IOException: Cannot run program "helm": error=11, Resource temporarily unavailable        at java.base/java.lang.ProcessBuilder.start(Unknown Source) ~[na:na]        at java.base/java.lang.ProcessBuilder.start(Unknown Source) ~[na:na]        at org.zeroturnaround.exec.ProcessExecutor.invokeStart(ProcessExecutor.java:997) ~[zt-exec-1.12.jar:na]        ... 19 common frames omittedCaused by: java.io.IOException: error=11, Resource temporarily unavailable        at java.base/java.lang.ProcessImpl.forkAndExec(Native Method) ~[na:na]        at java.base/java.lang.ProcessImpl.<init>(Unknown Source) ~[na:na]        at java.base/java.lang.ProcessImpl.start(Unknown Source) ~[na:na]        ... 22 common frames omitte
[773707.539s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 4k, detached.[773707.540s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-54750"
2024-04-24T07:09:37.905Z ERROR 7 --- [nio-8080-exec-9] o.a.c.c.C.[.[.[.[dispatcherServlet]      : Servlet.service() for servlet [dispatcherServlet] in context with path [/api] threw exception [Request processing failed: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError] with root caus
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached        at java.base/java.lang.Thread.start0(Native Method) ~[na:na]        at java.base/java.lang.Thread.start(Unknown Source) ~[na:na]        at org.zeroturnaround.exec.stream.PumpStreamHandler.start(PumpStreamHandler.java:175) ~[zt-exec-1.12.jar:na]        at org.zeroturnaround.exec.ProcessExecutor.startInternal(ProcessExecutor.java:1050) ~[zt-exec-1.12.jar:na]        at org.zeroturnaround.exec.ProcessExecutor.startInternal(ProcessExecutor.java:981) ~[zt-exec-1.12.jar:na]        at org.zeroturnaround.exec.ProcessExecutor.execute(ProcessExecutor.java:906) ~[zt-exec-1.12.jar:na]        at io.github.inseefrlab.helmwrapper.utils.Command.execute(Command.java:73) ~[java-helm-wrapper-v2.5.0.jar:v2.5.0
        at io.github.inseefrlab.helmwrapper.service.HelmInstallService.getAll(HelmInstallService.java:139) ~[java-helm-wrapper-v2.5.0.jar:v2.5.0]        at fr.insee.onyxia.api.services.impl.HelmAppsService.getHelmApp(HelmAppsService.java:370) ~[classes/:v2.5.0]        at fr.insee.onyxia.api.services.impl.HelmAppsService.lambda$getUserServices$2(HelmAppsService.java:249) ~[classes/:v2.5.0]        at java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[na:na]        at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source) ~[na:na]        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[na:na]        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[na:na]        at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(Unknown Source) ~[na:na]        at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(Unknown Source) ~[na:na]        at java.base/java.util.stream.AbstractTask.compute(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.CountedCompleter.exec(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) ~[na:na]        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) ~[na:na

After discussing with @olevitt, a possible way out of this could be to add a parameter in onyxia-api to allow limiting the maximum number of threads used for parallelization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant