Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU and Memory usage due to develop container #217

Open
jakubgs opened this issue Nov 23, 2023 · 27 comments
Open

High CPU and Memory usage due to develop container #217

jakubgs opened this issue Nov 23, 2023 · 27 comments
Assignees
Labels
bug Something isn't working

Comments

@jakubgs
Copy link
Contributor

jakubgs commented Nov 23, 2023

Summary:

Today I found the develop container using most of the CPU and Memory on the host:

image

And spewing these logs:

RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[1077]:0x135a01
    at fM (/app/.next/server/pages/api/og.js:66:56317)
    at e.wbg.__wbindgen_string_get (/app/.next/server/pages/api/og.js:66:61187)
    at wasm://wasm/005420d6:wasm-function[30]:0x3fcde
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.start (/app/.next/server/pages/api/og.js:66:65273)
RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[27]:0x38c4e
    at wasm://wasm/005420d6:wasm-function[30]:0x3fdcc
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.start (/app/.next/server/pages/api/og.js:66:65273)

Restart helped, at least for now.

Environment:

  • Operating System: Alpine Linux 3.17
  • Project Version: sha256:221faa7c3b63afc28ddb3788800874d001ef5cc720a37539e510de7161e8331c
@jakubgs jakubgs added the bug Something isn't working label Nov 23, 2023
@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 7, 2024

It appears this issue has not really been resolved, this is from today:

image

image

Very high CPU and Memory usage spike. CPU eventually settles but memory is not freed.

@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 24, 2024

I moved logos-press-engine and free-technology setups to a separate host due to these memory issues over a week ago.

Which was a good move, because now the persistent memory issues do not affect other websites.

This is the node-01.do-ams3.press.misc host right now:

image

image

Can devs here investigate this?

@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 24, 2024

Here's free-technology container:

image

And here's logos-press-engine container over 7 days:

image

@jinhojang6
Copy link
Collaborator

Thanks for the updates. Did you find any logs related to this issue?

@jakubgs

@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 29, 2024

You can find all the logs on Kibana:
https://kibana.infra.status.im/goto/7939e460-d740-11ee-aaa4-85391103106b

But currently there's barely any logs, and the containers are behaving so far:

image

Here's the current trend:

image

https://grafana.infra.status.im/d/QCTZ8-Vmk/single-host-dashboard?orgId=1&refresh=1m&var-host=node-01.do-ams3.press.misc&from=1708633580552&to=1709238380552

Ignore the drop off a the end, we are having issues with our metrics cluster: https://github.com/status-im/infra-hq/issues/125

@jinhojang6
Copy link
Collaborator

jinhojang6 commented Mar 6, 2024

Thanks @jakubgs

Does the develop container mean there is an issue with the develop(staging) branch, and the production (live) website is okay?

I think there is an issue with vercel/og given the logs you provided

RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[1077]:0x135a01
    at fM (/app/.next/server/pages/api/og.js:66:56317)
    at e.wbg.__wbindgen_string_get (/app/.next/server/pages/api/og.js:66:61187)
    at wasm://wasm/005420d6:wasm-function[30]:0x3fcde
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.start (/app/.next/server/pages/api/og.js:66:65273)

But we are using it both of the production & staging website to produce the og-image (link previews) on the server.

@jakubgs
Copy link
Contributor Author

jakubgs commented Mar 7, 2024

I don't believe this issue is specific to develop, it was simply first observed on develop.

If you look at a 30 day graph you see it always goes up to the limit of 4 GB and releases memory only after restart:

image

Ignore the two big spikes at the end, that's an artifact due to metrics cluster latency.

The host is constantly using pretty much all the SWAP memory:

image

@siddarthkay
Copy link

Hey @jinhojang6 : I was planning on assigning this to myself and to begin investigating this issue if you don't mind.
Also would you have any additional notes other than what is already logged in this issue ?

Thank you.

@jinhojang6
Copy link
Collaborator

jinhojang6 commented Mar 14, 2024

Thanks for the help @siddarthkay .

I don't have further information yet. Just suspect that the vercel/og makes the problem as written in the error log.

RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[1077]:0x135a01
    at fM (/app/.next/server/pages/api/og.js:66:56317)
    at e.wbg.__wbindgen_string_get (/app/.next/server/pages/api/og.js:66:61187)
    at wasm://wasm/005420d6:wasm-function[30]:0x3fcde
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)

@siddarthkay siddarthkay self-assigned this Mar 14, 2024
@jinhojang6
Copy link
Collaborator

jinhojang6 commented Mar 19, 2024

@siddarthkay @jakubgs

It seems that many people who are running NextJS with Docker experience the same issue.
vercel/next.js#61047

Updating to Next v14.1.3 (published last week) could solve the problem.

@siddarthkay
Copy link

yeah my money was on nextjs.. although upgrading to 14 would be a disruptive change no @jinhojang6 ?
since it uses an experimental router and everything ?

@jinhojang6
Copy link
Collaborator

@siddarthkay You are totally right. But I think we have no choice but to upgrade it to resolve this memory leak issue. What do you think?

@siddarthkay
Copy link

I am always up for upgrading stuff :D @jinhojang6
I was trying to recreate the problem locally by profiling the app and trying to find out exactly where the leak comes from.
But yeah planning for an upgrade is always good and something that we should often do.

@jinhojang6
Copy link
Collaborator

@siddarthkay I've upgraded it to 14.1.3 (05d009c) and the staging website is working well : https://lpe-staging.vercel.app/

Let me update the live website if you are okay.

@siddarthkay
Copy link

That was quick! @jinhojang6
yes please go on :)

@jinhojang6
Copy link
Collaborator

Updated https://press.logos.co/ Thanks! @siddarthkay

@jakubgs
Copy link
Contributor Author

jakubgs commented Mar 21, 2024

Can we apply the same upgrade to the software used for https://free.technology/ as well?

@jinhojang6
Copy link
Collaborator

@jakubgs I did it. Thanks for the double-check.

@jakubgs
Copy link
Contributor Author

jakubgs commented Mar 25, 2024

Seems like we are swapping again:

image

Memory usage has been growing for the past 4 days:

image

Not sure if this is resolved. We could try getting a bigger host, but I really don't think you need ~4 GB to run a simple website.

@yakimant
Copy link
Member

What's interesting, that's logos press engine this time, but not free technology:
Screenshot 2024-03-25 at 12 29 04

@jakubgs
Copy link
Contributor Author

jakubgs commented Mar 26, 2024

Indeed, the spikes are probably builds, but the fact it doesn't release resources seems like a bug to me.

Garbage software.

@yakimant
Copy link
Member

Looks like NextJS upgrade doesn't solve the issue.
I will add a daily containers restart for now.

@siddarthkay
Copy link

Okay I'll resume profiling the web app then.

@yakimant
Copy link
Member

Restart added: https://github.com/status-im/infra-misc/commit/dbf648695214aaf47fb08a9bc6895938e02700bc

Let's remove if the issue is fixed.

@yakimant
Copy link
Member

Nice, 100Mb per container, how it is supposed to be:
Screenshot 2024-03-27 at 15 31 05

@jakubgs
Copy link
Contributor Author

jakubgs commented Apr 18, 2024

It appears the problem still persists:

image

The source appears to be the 4111750 process from logos-press-engine-master container:

[email protected]:~ % d top logos-press-engine-master
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
dockrem+            4110682             4110616             0                   00:00               ?                   00:00:13            node /opt/yarn-v1.22.19/bin/yarn.js start
dockrem+            4111750             4110682             4                   00:00               ?                   00:26:29            next-server

Here's the graph:

image

It seems like the daily restart is not enough. Based on today's growth of memory usage we'd need to restart every 6 hours.

@jakubgs
Copy link
Contributor Author

jakubgs commented Apr 18, 2024

The logs contain a lot of those:

invalid Simplecast player url!

Probably unrelated tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants