Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lndmon shuts down after GetInfo #109

Closed
Fopstronaut opened this issue Jul 6, 2024 · 15 comments · Fixed by #110
Closed

lndmon shuts down after GetInfo #109

Fopstronaut opened this issue Jul 6, 2024 · 15 comments · Fixed by #110

Comments

@Fopstronaut
Copy link

lndmon keeps crashing with this error:

Lndmon exiting with error: ChainCollector GetInfo failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Stopping Prometheus Exporter
[INF] HTLC: Stopping Htlc Monitor
ChainCollector GetInfo failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded

LND version: 0.18.0-beta
lndmon version: 02ee7ca

@guggero
Copy link
Member

guggero commented Jul 8, 2024

context deadline exceeded means something is running into a timeout. Is your lnd responding quickly? Do you experience long response times as well if you run lncli getinfo?

@Fopstronaut
Copy link
Author

context deadline exceeded means something is running into a timeout. Is your lnd responding quickly? Do you experience long response times as well if you run lncli getinfo?

The longest it took was 171 ms, according to time. Most of the time, it was under 50 ms. I also haven't had any issues with timeouts using other tools like RTL.

@guggero
Copy link
Member

guggero commented Jul 11, 2024

So you see this consistently? Not just once in a while?
Can you post your lndmon config please?

@Fopstronaut
Copy link
Author

Yes, i see it consistently. I only tweaked lnd.tlspath and lnd.macaroondir - everything else is the default (lnd on localhost:10009, listening on :9092, and 30s timeout). The whole command is lndmon --lnd.tlspath %h/.lnd/tls.cert --lnd.macaroondir=%h/.lnd/data/chain/bitcoin/mainnet/.

@guggero
Copy link
Member

guggero commented Jul 15, 2024

It's very weird... Can you set an explicit value (e.g. --lnd.rpctimeout=1m) to make sure it's not a weird issue with 0 being used as the timeout?

@Fopstronaut
Copy link
Author

I first tried setting the value to 30s, but it still timed out. However, once I bumped it up to 1m, I noticed it hasn't failed in the past two days. It's making me curious why it needed more than 30 seconds, but at least it's working fine now!

@Fopstronaut
Copy link
Author

Nevermind, looks like it's still not working; it’s timed out a few times again in the past few days.

@mrfelton
Copy link
Contributor

We see this pretty frequently too and I also understand the issue to be from slow calls to GetInfo. We see it from time to tie even with the rpc timeout increased to 120s.

Is it intentional that the service crashes when this happens? Seems like it could be better handled by logging an error message.

@guggero
Copy link
Member

guggero commented Jul 26, 2024

Very weird that the GetInfo call would sometimes take that long...

Is it intentional that the service crashes when this happens?

I think the idea is to make sure lndmon doesn't just sit there when the connection to lnd fails or the certificate is expired. So shutting down means it will be restarted in a Docker/k8s/systemd environment and the connection re-attempted.
But I guess we could look at the error and not shut down on timeouts. I'll create a PR.

@mrfelton
Copy link
Contributor

I think the idea is to make sure lndmon doesn't just sit there when the connection to lnd fails or the certificate is expired.

Maybe use the GetState RPC to run that level of health check?

@guggero
Copy link
Member

guggero commented Jul 26, 2024

Maybe use the GetState RPC to run that level of health check?

I added an error check in #110 for just the GetInfo call, since that seems to sometimes take longer. It's weird that we never see that problem in our environment...

What's your lnd's database backend? bbolt? And have you turned on freelist sync?

@mrfelton
Copy link
Contributor

What's your lnd's database backend? bbolt? And have you turned on freelist sync?

Yes, bbolt. freelist sync is not on.

@mrfelton
Copy link
Contributor

I think part of the cause for slowness in at least one of our cases is due to running Bitcoind without an SSD backing it. It performs noticeably worse than with an SSD and we have seen other instances of bad bitcoind performance resulting in extended GetInfo calls.

@guggero
Copy link
Member

guggero commented Jul 26, 2024

Ah, right. Because GetInfo does make one or more calls to bitcoind that can actually be the slowing factor. And it might actually hit the RPC connection limit on bitcoind if many other queries are going on at the same time, then needs to wait for a connection slot to free up.
So I think the approach of only allowing timeouts on the GetInfo call in the PR above is probably right. Then we'll still shut down if lndmon can't connect to lnd anymore.

@guggero
Copy link
Member

guggero commented Jul 30, 2024

@mrfelton and @Fopstronaut can you please try if #110 fixes your problem? Then I can merge that and push a new release out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants