lndmon shuts down after GetInfo #109

Fopstronaut · 2024-07-06T14:33:39Z

lndmon keeps crashing with this error:

Lndmon exiting with error: ChainCollector GetInfo failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Stopping Prometheus Exporter
[INF] HTLC: Stopping Htlc Monitor
ChainCollector GetInfo failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded

LND version: 0.18.0-beta
lndmon version: 02ee7ca

The text was updated successfully, but these errors were encountered:

guggero · 2024-07-08T06:43:35Z

context deadline exceeded means something is running into a timeout. Is your lnd responding quickly? Do you experience long response times as well if you run lncli getinfo?

Fopstronaut · 2024-07-11T04:09:08Z

context deadline exceeded means something is running into a timeout. Is your lnd responding quickly? Do you experience long response times as well if you run lncli getinfo?

The longest it took was 171 ms, according to time. Most of the time, it was under 50 ms. I also haven't had any issues with timeouts using other tools like RTL.

guggero · 2024-07-11T06:20:00Z

So you see this consistently? Not just once in a while?
Can you post your lndmon config please?

Fopstronaut · 2024-07-11T15:04:29Z

Yes, i see it consistently. I only tweaked lnd.tlspath and lnd.macaroondir - everything else is the default (lnd on localhost:10009, listening on :9092, and 30s timeout). The whole command is lndmon --lnd.tlspath %h/.lnd/tls.cert --lnd.macaroondir=%h/.lnd/data/chain/bitcoin/mainnet/.

guggero · 2024-07-15T08:21:36Z

It's very weird... Can you set an explicit value (e.g. --lnd.rpctimeout=1m) to make sure it's not a weird issue with 0 being used as the timeout?

Fopstronaut · 2024-07-18T03:06:56Z

I first tried setting the value to 30s, but it still timed out. However, once I bumped it up to 1m, I noticed it hasn't failed in the past two days. It's making me curious why it needed more than 30 seconds, but at least it's working fine now!

Fopstronaut · 2024-07-26T04:35:45Z

Nevermind, looks like it's still not working; it’s timed out a few times again in the past few days.

mrfelton · 2024-07-26T11:20:39Z

We see this pretty frequently too and I also understand the issue to be from slow calls to GetInfo. We see it from time to tie even with the rpc timeout increased to 120s.

Is it intentional that the service crashes when this happens? Seems like it could be better handled by logging an error message.

guggero · 2024-07-26T11:29:21Z

Very weird that the GetInfo call would sometimes take that long...

Is it intentional that the service crashes when this happens?

I think the idea is to make sure lndmon doesn't just sit there when the connection to lnd fails or the certificate is expired. So shutting down means it will be restarted in a Docker/k8s/systemd environment and the connection re-attempted.
But I guess we could look at the error and not shut down on timeouts. I'll create a PR.

mrfelton · 2024-07-26T11:37:53Z

I think the idea is to make sure lndmon doesn't just sit there when the connection to lnd fails or the certificate is expired.

Maybe use the GetState RPC to run that level of health check?

guggero · 2024-07-26T11:45:16Z

Maybe use the GetState RPC to run that level of health check?

I added an error check in #110 for just the GetInfo call, since that seems to sometimes take longer. It's weird that we never see that problem in our environment...

What's your lnd's database backend? bbolt? And have you turned on freelist sync?

mrfelton · 2024-07-26T11:52:27Z

What's your lnd's database backend? bbolt? And have you turned on freelist sync?

Yes, bbolt. freelist sync is not on.

mrfelton · 2024-07-26T12:46:42Z

I think part of the cause for slowness in at least one of our cases is due to running Bitcoind without an SSD backing it. It performs noticeably worse than with an SSD and we have seen other instances of bad bitcoind performance resulting in extended GetInfo calls.

guggero · 2024-07-26T13:38:15Z

Ah, right. Because GetInfo does make one or more calls to bitcoind that can actually be the slowing factor. And it might actually hit the RPC connection limit on bitcoind if many other queries are going on at the same time, then needs to wait for a connection slot to free up.
So I think the approach of only allowing timeouts on the GetInfo call in the PR above is probably right. Then we'll still shut down if lndmon can't connect to lnd anymore.

guggero · 2024-07-30T07:14:40Z

@mrfelton and @Fopstronaut can you please try if #110 fixes your problem? Then I can merge that and push a new release out.

guggero mentioned this issue Jul 26, 2024

collectors: don't shut down on timeout on GetInfo RPC call #110

Merged

guggero closed this as completed in #110 Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lndmon shuts down after GetInfo #109

lndmon shuts down after GetInfo #109

Fopstronaut commented Jul 6, 2024

guggero commented Jul 8, 2024

Fopstronaut commented Jul 11, 2024

guggero commented Jul 11, 2024

Fopstronaut commented Jul 11, 2024

guggero commented Jul 15, 2024

Fopstronaut commented Jul 18, 2024

Fopstronaut commented Jul 26, 2024

mrfelton commented Jul 26, 2024

guggero commented Jul 26, 2024

mrfelton commented Jul 26, 2024

guggero commented Jul 26, 2024

mrfelton commented Jul 26, 2024

mrfelton commented Jul 26, 2024

guggero commented Jul 26, 2024

guggero commented Jul 30, 2024

lndmon shuts down after GetInfo #109

lndmon shuts down after GetInfo #109

Comments

Fopstronaut commented Jul 6, 2024

guggero commented Jul 8, 2024

Fopstronaut commented Jul 11, 2024

guggero commented Jul 11, 2024

Fopstronaut commented Jul 11, 2024

guggero commented Jul 15, 2024

Fopstronaut commented Jul 18, 2024

Fopstronaut commented Jul 26, 2024

mrfelton commented Jul 26, 2024

guggero commented Jul 26, 2024

mrfelton commented Jul 26, 2024

guggero commented Jul 26, 2024

mrfelton commented Jul 26, 2024

mrfelton commented Jul 26, 2024

guggero commented Jul 26, 2024

guggero commented Jul 30, 2024