Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out waiting for connection #2629

Open
holzeis opened this issue Jun 16, 2024 · 5 comments
Open

Timed out waiting for connection #2629

holzeis opened this issue Jun 16, 2024 · 5 comments
Assignees
Labels
bug Something isn't working critical prod-environment

Comments

@holzeis
Copy link
Contributor

holzeis commented Jun 16, 2024

The coordinator is regularly failing to acquire a connection from the diesel pool manager.

{"timestamp":"2024-06-14T15:18:30.520400301Z","level":"ERROR","fields":{"message":"Failed to store dlc message. timed out waiting for connection","peer":"02aa6fc81e341d69b36174365644a3560bfcc51ce0fb9bb67d93712bc54117cbb8"},"target":"coordinator::dlc_handler"}
          This problem seems to be happening frequently

https://itchysats.grafana.net/explore?schemaVersion=1&panes=%7B%2291e%22:%7B%22datasource%22:%22grafanacloud-logs%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22grafanacloud-logs%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22%7Bcontainer%3D%5C%22coordinator%5C%22%7D%20%7C%3D%20%60timed%20out%20waiting%20for%20connection%60%20%7C%20json%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-30d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

Originally posted by @holzeis in #2628 (comment)

@holzeis holzeis added bug Something isn't working prod-environment labels Jun 16, 2024
@holzeis holzeis self-assigned this Jun 17, 2024
@holzeis
Copy link
Contributor Author

holzeis commented Jun 17, 2024

@holzeis
Copy link
Contributor Author

holzeis commented Jun 17, 2024

the max connections is 100

orderbook=# select setting::int max_conn from pg_settings where name=$$max_connections$$;
 max_conn 
----------
      100
(1 row)

Note, it seems the metabase takes 15 connections from that.

orderbook=# SELECT datname, numbackends FROM pg_stat_database;
    datname    | numbackends 
---------------+-------------
               |           0
 postgres      |           0
 template1     |           0
 template0     |           0
 maker         |           0
 oracle        |           0
 metabaseappdb |          15
 orderbook     |          12
 lnd           |           0
(9 rows)

@holzeis
Copy link
Contributor Author

holzeis commented Jun 17, 2024

Weirdly the max pool size of the diesel pool manager is only 10

https://github.com/sfackler/r2d2/blob/6ab3302eeca82732a5e6c9385092fdb6e011f173/src/config.rs#L83-L94

@holzeis
Copy link
Contributor Author

holzeis commented Jun 17, 2024

I tried to reproduce this issue by opening about 100 positions, with additional 3 apps open that are constantly connected. I setup the rollover to run every 50 seconds, but every was working just fine besides a minor flaw reported here

$ docker exec  db psql -U postgres -d orderbook -c "SELECT user, query FROM pg_stat_activity where query != ''";                                  
   user   |                                                                                                                                          query                                                                                                                                           
----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 postgres | INSERT INTO "last_outbound_dlc_messages" ("peer_id", "message_hash", "message") VALUES ($1, $2, $3) ON CONFLICT ("peer_id") DO UPDATE SET "peer_id" = $4, "message_hash" = $5, "message" = $6
 postgres | INSERT INTO "last_outbound_dlc_messages" ("peer_id", "message_hash", "message") VALUES ($1, $2, $3) ON CONFLICT ("peer_id") DO UPDATE SET "peer_id" = $4, "message_hash" = $5, "message" = $6
 postgres | SELECT "dlc_messages"."message_type", "last_outbound_dlc_messages"."message" FROM ("last_outbound_dlc_messages" INNER JOIN "dlc_messages" ON ("dlc_messages"."message_hash" = "last_outbound_dlc_messages"."message_hash")) WHERE ("last_outbound_dlc_messages"."peer_id" = $1) LIMIT $2
 postgres | SET CLIENT_ENCODING TO 'UTF8'
 postgres | SELECT "users"."id", "users"."pubkey", "users"."contact", "users"."timestamp", "users"."fcm_token", "users"."last_login", "users"."nickname", "users"."version", "users"."referral_code", "users"."used_referral_code", "users"."os" FROM "users" WHERE ("users"."pubkey" = ANY($1))
 postgres | SET CLIENT_ENCODING TO 'UTF8'
 postgres | SELECT "users"."id", "users"."pubkey", "users"."contact", "users"."timestamp", "users"."fcm_token", "users"."last_login", "users"."nickname", "users"."version", "users"."referral_code", "users"."used_referral_code", "users"."os" FROM "users" WHERE ("users"."pubkey" = ANY($1))
 postgres | SELECT "users"."id", "users"."pubkey", "users"."contact", "users"."timestamp", "users"."fcm_token", "users"."last_login", "users"."nickname", "users"."version", "users"."referral_code", "users"."used_referral_code", "users"."os" FROM "users" WHERE ("users"."pubkey" = ANY($1))
 postgres | SELECT user, query FROM pg_stat_activity where query != ''
 postgres | SELECT "funding_rates"."id", "funding_rates"."start_date", "funding_rates"."end_date", "funding_rates"."rate", "funding_rates"."timestamp" FROM "funding_rates" WHERE ("funding_rates"."end_date" = $1) LIMIT $2
 postgres | SELECT "__diesel_schema_migrations"."version" FROM "__diesel_schema_migrations" ORDER BY "__diesel_schema_migrations"."version" DESC 
 postgres | SELECT * FROM "event_data" WHERE "event_data"."published_date" = $1 AND "event_data"."asset_id" = $2 ORDER BY "event_data"."created_at" LIMIT 1
(12 rows)
$ docker exec  db psql -U postgres -d orderbook -c "SELECT datname, numbackends FROM pg_stat_database";         
  datname  | numbackends 
-----------+-------------
           |           0
 postgres  |           0
 template1 |           0
 template0 |           0
 orderbook |          11
 maker     |           0
 oracle    |           1
(7 rows)

Everything seems to be stable and now errors are thrown due to timed out waiting for connection.

@holzeis
Copy link
Contributor Author

holzeis commented Jun 17, 2024

One way we could deal with that issue is to make use of try_get and handle the scenario where we do not get a connection. e.g. by retrying. This would be more resilient than to simply fail whenever we do not have enough connections available.

holzeis added a commit that referenced this issue Jun 17, 2024
Since the connections are limited it's better to not block a connection unnecessarily for a long time.

As we are using a connection pool (r2d2) getting a connection is very cheap.

This might help in the context of #2629
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical prod-environment
Projects
None yet
Development

No branches or pull requests

1 participant