High CPU and Memory usage due to develop container #217

jakubgs · 2023-11-23T10:41:56Z

Summary:

Today I found the develop container using most of the CPU and Memory on the host:

And spewing these logs:

RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[1077]:0x135a01
    at fM (/app/.next/server/pages/api/og.js:66:56317)
    at e.wbg.__wbindgen_string_get (/app/.next/server/pages/api/og.js:66:61187)
    at wasm://wasm/005420d6:wasm-function[30]:0x3fcde
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.start (/app/.next/server/pages/api/og.js:66:65273)
RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[27]:0x38c4e
    at wasm://wasm/005420d6:wasm-function[30]:0x3fdcc
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.start (/app/.next/server/pages/api/og.js:66:65273)

Restart helped, at least for now.

Environment:

Operating System: Alpine Linux 3.17
Project Version: sha256:221faa7c3b63afc28ddb3788800874d001ef5cc720a37539e510de7161e8331c

The text was updated successfully, but these errors were encountered:

jakubgs · 2024-02-07T18:30:34Z

It appears this issue has not really been resolved, this is from today:

Very high CPU and Memory usage spike. CPU eventually settles but memory is not freed.

jakubgs · 2024-02-24T09:58:48Z

I moved logos-press-engine and free-technology setups to a separate host due to these memory issues over a week ago.

infra-misc#e7be53f6 - aptly-server: add broken Status logo image

Which was a good move, because now the persistent memory issues do not affect other websites.

This is the node-01.do-ams3.press.misc host right now:

Can devs here investigate this?

jakubgs · 2024-02-24T09:59:27Z

Here's free-technology container:

And here's logos-press-engine container over 7 days:

jinhojang6 · 2024-02-29T17:05:23Z

Thanks for the updates. Did you find any logs related to this issue?

@jakubgs

jakubgs · 2024-02-29T20:26:42Z

You can find all the logs on Kibana:
https://kibana.infra.status.im/goto/7939e460-d740-11ee-aaa4-85391103106b

But currently there's barely any logs, and the containers are behaving so far:

Here's the current trend:

https://grafana.infra.status.im/d/QCTZ8-Vmk/single-host-dashboard?orgId=1&refresh=1m&var-host=node-01.do-ams3.press.misc&from=1708633580552&to=1709238380552

Ignore the drop off a the end, we are having issues with our metrics cluster: https://github.com/status-im/infra-hq/issues/125

jinhojang6 · 2024-03-06T10:46:16Z

Thanks @jakubgs

Does the develop container mean there is an issue with the develop(staging) branch, and the production (live) website is okay?

I think there is an issue with vercel/og given the logs you provided

RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[1077]:0x135a01
    at fM (/app/.next/server/pages/api/og.js:66:56317)
    at e.wbg.__wbindgen_string_get (/app/.next/server/pages/api/og.js:66:61187)
    at wasm://wasm/005420d6:wasm-function[30]:0x3fcde
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.start (/app/.next/server/pages/api/og.js:66:65273)

But we are using it both of the production & staging website to produce the og-image (link previews) on the server.

jakubgs · 2024-03-07T08:54:16Z

I don't believe this issue is specific to develop, it was simply first observed on develop.

If you look at a 30 day graph you see it always goes up to the limit of 4 GB and releases memory only after restart:

Ignore the two big spikes at the end, that's an artifact due to metrics cluster latency.

The host is constantly using pretty much all the SWAP memory:

siddarthkay · 2024-03-14T12:04:18Z

Hey @jinhojang6 : I was planning on assigning this to myself and to begin investigating this issue if you don't mind.
Also would you have any additional notes other than what is already logged in this issue ?

Thank you.

jinhojang6 · 2024-03-14T12:43:24Z

Thanks for the help @siddarthkay .

I don't have further information yet. Just suspect that the vercel/og makes the problem as written in the error log.

RuntimeError: unreachable
    at wasm://wasm/005420d6:wasm-function[1077]:0x135a01
    at fM (/app/.next/server/pages/api/og.js:66:56317)
    at e.wbg.__wbindgen_string_get (/app/.next/server/pages/api/og.js:66:61187)
    at wasm://wasm/005420d6:wasm-function[30]:0x3fcde
    at new fH (/app/.next/server/pages/api/og.js:66:58408)
    at new f1 (/app/.next/server/pages/api/og.js:66:62302)
    at pr (/app/.next/server/pages/api/og.js:66:64887)

reference: https://vercel.com/docs/functions/og-image-generation

jinhojang6 · 2024-03-19T12:55:59Z

@siddarthkay @jakubgs

It seems that many people who are running NextJS with Docker experience the same issue.
vercel/next.js#61047

Updating to Next v14.1.3 (published last week) could solve the problem.

siddarthkay · 2024-03-19T12:59:14Z

yeah my money was on nextjs.. although upgrading to 14 would be a disruptive change no @jinhojang6 ?
since it uses an experimental router and everything ?

jinhojang6 · 2024-03-19T13:00:58Z

@siddarthkay You are totally right. But I think we have no choice but to upgrade it to resolve this memory leak issue. What do you think?

siddarthkay · 2024-03-19T13:02:49Z

I am always up for upgrading stuff :D @jinhojang6
I was trying to recreate the problem locally by profiling the app and trying to find out exactly where the leak comes from.
But yeah planning for an upgrade is always good and something that we should often do.

jinhojang6 · 2024-03-19T13:16:10Z

@siddarthkay I've upgraded it to 14.1.3 (05d009c) and the staging website is working well : https://lpe-staging.vercel.app/

Let me update the live website if you are okay.

siddarthkay · 2024-03-19T13:17:49Z

That was quick! @jinhojang6
yes please go on :)

jinhojang6 · 2024-03-19T14:41:41Z

Updated https://press.logos.co/ Thanks! @siddarthkay

jakubgs · 2024-03-21T17:53:58Z

Can we apply the same upgrade to the software used for https://free.technology/ as well?

jinhojang6 · 2024-03-23T08:39:42Z

@jakubgs I did it. Thanks for the double-check.

jakubgs · 2024-03-25T09:30:25Z

Seems like we are swapping again:

Memory usage has been growing for the past 4 days:

Not sure if this is resolved. We could try getting a bigger host, but I really don't think you need ~4 GB to run a simple website.

yakimant · 2024-03-25T11:30:06Z

What's interesting, that's logos press engine this time, but not free technology:

jakubgs · 2024-03-26T11:40:14Z

Indeed, the spikes are probably builds, but the fact it doesn't release resources seems like a bug to me.

Garbage software.

yakimant · 2024-03-27T13:06:15Z

Looks like NextJS upgrade doesn't solve the issue.
I will add a daily containers restart for now.

siddarthkay · 2024-03-27T13:15:54Z

Okay I'll resume profiling the web app then.

yakimant · 2024-03-27T14:29:09Z

Restart added: https://github.com/status-im/infra-misc/commit/dbf648695214aaf47fb08a9bc6895938e02700bc

Let's remove if the issue is fixed.

yakimant · 2024-03-27T14:31:40Z

Nice, 100Mb per container, how it is supposed to be:

jakubgs · 2024-04-18T10:41:20Z

It appears the problem still persists:

The source appears to be the 4111750 process from logos-press-engine-master container:

[email protected]:~ % d top logos-press-engine-master
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
dockrem+            4110682             4110616             0                   00:00               ?                   00:00:13            node /opt/yarn-v1.22.19/bin/yarn.js start
dockrem+            4111750             4110682             4                   00:00               ?                   00:26:29            next-server

Here's the graph:

It seems like the daily restart is not enough. Based on today's growth of memory usage we'd need to restart every 6 hours.

jakubgs · 2024-04-18T10:46:17Z

The logs contain a lot of those:

invalid Simplecast player url!

Probably unrelated tho.

jakubgs added the bug Something isn't working label Nov 23, 2023

siddarthkay self-assigned this Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU and Memory usage due to develop container #217

High CPU and Memory usage due to develop container #217

jakubgs commented Nov 23, 2023

jakubgs commented Feb 7, 2024

jakubgs commented Feb 24, 2024

jakubgs commented Feb 24, 2024

jinhojang6 commented Feb 29, 2024

jakubgs commented Feb 29, 2024

jinhojang6 commented Mar 6, 2024 •

edited

Loading

jakubgs commented Mar 7, 2024

siddarthkay commented Mar 14, 2024

jinhojang6 commented Mar 14, 2024 •

edited

Loading

jinhojang6 commented Mar 19, 2024 •

edited

Loading

siddarthkay commented Mar 19, 2024

jinhojang6 commented Mar 19, 2024

siddarthkay commented Mar 19, 2024

jinhojang6 commented Mar 19, 2024

siddarthkay commented Mar 19, 2024

jinhojang6 commented Mar 19, 2024

jakubgs commented Mar 21, 2024

jinhojang6 commented Mar 23, 2024

jakubgs commented Mar 25, 2024

yakimant commented Mar 25, 2024

jakubgs commented Mar 26, 2024

yakimant commented Mar 27, 2024

siddarthkay commented Mar 27, 2024

yakimant commented Mar 27, 2024

yakimant commented Mar 27, 2024

jakubgs commented Apr 18, 2024 •

edited

Loading

jakubgs commented Apr 18, 2024 •

edited

Loading

High CPU and Memory usage due to develop container #217

High CPU and Memory usage due to develop container #217

Comments

jakubgs commented Nov 23, 2023

Summary:

Environment:

jakubgs commented Feb 7, 2024

jakubgs commented Feb 24, 2024

jakubgs commented Feb 24, 2024

jinhojang6 commented Feb 29, 2024

jakubgs commented Feb 29, 2024

jinhojang6 commented Mar 6, 2024 • edited Loading

jakubgs commented Mar 7, 2024

siddarthkay commented Mar 14, 2024

jinhojang6 commented Mar 14, 2024 • edited Loading

jinhojang6 commented Mar 19, 2024 • edited Loading

siddarthkay commented Mar 19, 2024

jinhojang6 commented Mar 19, 2024

siddarthkay commented Mar 19, 2024

jinhojang6 commented Mar 19, 2024

siddarthkay commented Mar 19, 2024

jinhojang6 commented Mar 19, 2024

jakubgs commented Mar 21, 2024

jinhojang6 commented Mar 23, 2024

jakubgs commented Mar 25, 2024

yakimant commented Mar 25, 2024

jakubgs commented Mar 26, 2024

yakimant commented Mar 27, 2024

siddarthkay commented Mar 27, 2024

yakimant commented Mar 27, 2024

yakimant commented Mar 27, 2024

jakubgs commented Apr 18, 2024 • edited Loading

jakubgs commented Apr 18, 2024 • edited Loading

jinhojang6 commented Mar 6, 2024 •

edited

Loading

jinhojang6 commented Mar 14, 2024 •

edited

Loading

jinhojang6 commented Mar 19, 2024 •

edited

Loading

jakubgs commented Apr 18, 2024 •

edited

Loading

jakubgs commented Apr 18, 2024 •

edited

Loading