Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locking mechanism for index sets during datanode remote reindex migration #20222

Merged
merged 22 commits into from
Aug 29, 2024

Conversation

todvora
Copy link
Contributor

@todvora todvora commented Aug 20, 2024

This PR adds proper locking mechanism for locking of index sets during remote reindex migration. These locks skip index rotation and retention if the migration is running. In the same time they force the migration to wait for finish of any currently running index set maintenance tasks before the actual data migration can start.

Description

The newly introduced DatanodeMigrationLockService, building on top of the existing distributed LockService is creating exclusive locks for index sets. The service allows two types of work with the lock - wait for a lock and block the thread or try to run a code block if the lock is free to take. The blocking mechanism is used for the migration itself. It will wait till the background maintenance is done. The nonblocking try approach is applied in the maintenance tasks. Their execution will be skipped for an index set if this is occupied by the migration.

Motivation and Context

Fixes https://github.com/Graylog2/graylog-plugin-enterprise/issues/8281

How Has This Been Tested?

Added unit tests, existing integration tests

Screenshots

image

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@todvora todvora marked this pull request as ready for review August 20, 2024 08:51
@todvora todvora force-pushed the fix/pause_rotation_thread_migration branch from ecccadc to db02256 Compare August 21, 2024 12:04
@todvora todvora changed the title Pause index rotation during migration Locking mechanism for index sets during datanode remote reindex migration Aug 22, 2024
@todvora todvora requested a review from bernd August 22, 2024 08:55
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;

public class DatanodeMigrationLockServiceImpl implements DatanodeMigrationLockService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand the code, so please forgive me this silly question, but how do we handle multi-node setup?
Can each GL node have its own migration lock service?
If yes, is it important that activeLocks collection contains all the locks?
If yes, how do we get locks created by another GL nodes appear in that collection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that activeLocks collection is used just for extending the locks?
So maybe me previous questions are not important at all...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each GL node will have its own migration lock service. The locks are persisted in the mongodb, so they are distributed and block parallel usage of the index set across the whole cluster. The activeLocks collection is there only to keep the locally requested locks from expiring. Otherwise they'll automatically disappear from the mongo collection, as they have defined a TTL. The local activeLocks collection doesn't need to be distributed, each node is refreshing its own locks. In the moment the node disappears, these lock disappear as well, which is a good thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if you can use the RefreshingLockService to refresh the locks instead of implementing your own refresh system. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with RefreshingLockService is that it can hold one lock only. I'd need one instance of the service for each lock. Instead of implementing own refresh system I'd need to implement a system for managing service instances, which seems even worse :-)

Copy link
Contributor

@luk-kaminski luk-kaminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Looks complex as well. I have no idea how to simplify it, however.
I'd only suggest considering a few javadoc comments here and there, taken from the PR description and our discussions, so that no one has to ask himself the same questions we have been asking.

Thank you for a lot of hard work you have put into this one!

@todvora todvora merged commit 6641e2b into master Aug 29, 2024
7 checks passed
@todvora todvora deleted the fix/pause_rotation_thread_migration branch August 29, 2024 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants