Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework threading model #7778

Merged
merged 13 commits into from
Oct 4, 2024
Merged

Conversation

NBKelly
Copy link
Collaborator

@NBKelly NBKelly commented Sep 27, 2024

This turned out to be 200% easier than I thought it would be.

We're supposedly not meant to use virtual threads for anything that touches atoms, because atoms do funny concurrency things already, so we're still using claypoole's thread pools. But now we're keeping a fixed set of pools (cores+2), and assigning each lobby a pool at random from the least occupied pools. The only thing that really changes (we have to track) is just an integer for each pool. Other than that, this literally plugs 1:1 into the other code I wrote, which is neat.

I also enabled the simple-auto-threading? option on sente's socket server, which I'll try out on playtest, but that might be overkill. Probably best to trial that later, instead of doing too much at once.

I'm going to sample it on playtest over the next week or two and see if it causes any issues or is noticable at all.


Summary of changes:

  • We use a fixed pool of threads to manage games. Each game is assigned a thread to work with from the pool, and if there are enough games you'll have 2/3+ lobbies on the same thread.
  • Everything is still queued up sequentially on the thread, so things will be processed in order still
  • The lobby itself uses it's own thread, so lobby updates wont slow down games, and games will (mostly) not slow down other games or the lobby
  • We now log the throughput time of each ws interaction as it's handled, and display that in our metrics (number of invocations, average, percentiles)
  • Prune idle users a little more aggressively (1 hour of no interaction)
  • The list of lobbies gets stripped and sorted only once, instead of once per user. (note that this sort could theoretically mutate the state, which would amortize time across invocations. I don't think it's worth it though)

@NoahTheDuke
Copy link
Collaborator

cool, this looks closer to what we need

@NBKelly
Copy link
Collaborator Author

NBKelly commented Sep 28, 2024

I decided it would be worthwhile to trace execution time on a per-endpoint basis. The buckets get cleared out every 5 minutes when we resolve telemetry, so they shouldn't explode the state.

telemetry_001

@NBKelly
Copy link
Collaborator Author

NBKelly commented Sep 28, 2024

I cleaned up broadcast-lobby-list a little.
I noticed our workflow was:

  • Filter the list (per user)
  • sort it (per user, their own filtered version)
  • apply summaries (strip info)
  • wrap it in a vec

Which is the exact same as:

  • Sort it once
  • Strip it once
  • per user, filter and wrap in a vec

So we can cut out around half of the work for that, which is nice.

@NBKelly
Copy link
Collaborator Author

NBKelly commented Sep 28, 2024

I changed the lobby sub timeout from 6 hours to 1 hour, and made it so they actually get pruned (I missed a case before), and also get periodically cleaned up (currently every 5 or so minutes).

@NBKelly
Copy link
Collaborator Author

NBKelly commented Sep 29, 2024

right now it all seems to be working on playtest, so I'm going to mark this as ready for review

@NBKelly NBKelly marked this pull request as ready for review September 29, 2024 00:35
@NBKelly
Copy link
Collaborator Author

NBKelly commented Oct 4, 2024

just realized I made a really dumb mistake. I'll commit a few more times in a minute.

@NBKelly NBKelly marked this pull request as draft October 4, 2024 02:30
@NBKelly
Copy link
Collaborator Author

NBKelly commented Oct 4, 2024

ok so it turns out I was passing in the executed function bodies to the thread pools, so they weren't really doing anything at all 😂. I converted them to macros (this took an embarrassing amount of time to figure out), and they appear to function correctly now, so we should actually be load balancing stuff.

I also made it so if for some reason we try to execute an op on a nil lobby, and the function doesn't already check for it, we execute that on the lobby thread rather than a non-existent thread from the non-existent lobby.

@NBKelly NBKelly marked this pull request as ready for review October 4, 2024 03:18
@NoahTheDuke
Copy link
Collaborator

i say go for it. can't get much worse

@NBKelly NBKelly merged commit 1e5b0d1 into mtgred:master Oct 4, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants