Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak of command inputs #2262

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Quinn-With-Two-Ns
Copy link
Contributor

Fix memory leak of command inputs. Before this change the Java SDK was holding onto the input of various operations (activities, local activities, child workflow) after the operation had already started for two reasons:

  • The state machine was holding a strong reference to the command even after it had been sent to the server
  • The output handle needed some of the type informations from the input, but that kept a reference to the input parameters as well

This could cause an OOM error if a lot of async operations were started in a specific workflow.

There is another memory leak in the SDK where the JVM cannot garbage collect complete state machines because they are being held by the cancellation callback in the CancellationScope. This PR does not address this issue because it is significantly more complicated, given the above fixes the impact is much less since only the relatively small amount of data for the statemachine is retained

closes #2203

@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner October 9, 2024 17:47
// the command in memory longer than needed. This is safe because if no other state machine holds
// a reference
// to the command then cancelling it wouldn't have any effect.
private WeakReference<CancellableCommand> commandRef;
Copy link
Member

@cretz cretz Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm. Can you clarify "This is safe because if no other state machine holds a reference to the command then cancelling it wouldn't have any effect" a bit? For example, I am reading TimerStateMachine and not seeing where another state machine would hold this command.

In general, why is the reference to this overall EntityStateMachineInitialCommand being held on to longer than it's cancellable? And if the lifetime of EntityStateMachineInitialCommand is longer than its lifetime of being cancellable, is there a more explicit way to mark this as no longer cancellable instead of relying on Java references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify "This is safe because if no other state machine holds a reference to the command then cancelling it wouldn't have any effect"

If nothing else is holding a reference to the command then nothing can observe that it was cancelled.

Why is the reference to this overall EntityStateMachineInitialCommand

The state machine needs to be held as long as whatever operation is running

is there a more explicit way to mark this as no longer cancellable instead of relying on Java references

No, there is nothing generic

Copy link
Member

@cretz cretz Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else is holding a reference to the command then nothing can observe that it was cancelled

So does this turn cancel into a no-op? Meaning is there a possibility that nothing holds a reference but we still need to issue the cancel command for determinism correctness? Can you demonstrate a case from a workflow author POV where the TimerStateMachine is still held but it cannot be canceled? Does this only apply for detached cancellation scope because otherwise all other cancellation scopes are parented to the root and must issue cancels when the root is canceled? (I am not completely familiar with cancellation scope hierarchy in Java SDK)

Or am I confusing this cancellable command that affects how the user sees cancel, with the ability to send a cancel command on task complete?

The state machine needs to be held as long as whatever operation is running

I guess I am a bit confused where a state machine is running but cannot be cancelled.

I wonder if there's a better way to split off cancellation from the command. It seems that the issue is that the ability to cancel is holding more information than it needs. Can we make cancel only hold the information it needs instead of the entire command with input/output? Then that even helps people that do still need to cancel things I assume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancelling was a no-op before and after this change. This change just means we don't hold only the command ,which may potentially be large, in memory

I guess I am a bit confused where a state machine is running but cannot be cancelled.

This is the command being canceled not the state machine

This change has not effect on the correctness of the SDK, it is purely about releasing memory when it is not needed.

Copy link
Member

@cretz cretz Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you demonstrate a case from a workflow author POV where the TimerStateMachine is still held but this commandRef reference may be gone? And if commandRef is gone, does that mean it cannot be canceled?

This change just means we don't hold only the command ,which may potentially be large, in memory

The question becomes why does the command have to be all or none? Can cancellation part/ability of a command be separated from the large-memory part of the command?

This change has not effect on the correctness of the SDK, it is purely about releasing memory when it is not needed.

Right, I am just a bit concerned relying on Java reference counts now for cancellation just because some input/output state is being held along with it and we want to save memory (instead of separating cancellation from the other memory it holds).

I may just need to see a workflow where this weak reference can be removed.

Copy link
Member

@cretz cretz Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a workflow test like that for activity as well

Hrmm, not sure that test covers whether a cancel command is sent (or even if the activity gets scheduled or maybe the signal is sent in the same task). I may write a test just to check. So question, in that test, is the weak reference for the activity cancellable command present by the time this line is reached:

I'll try to set some time aside to check. If you need to make haste on this, I can mark approved in the meantime though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are confused , cancelling a command does not send a different command. Again this PR does not change any behaviour it is and we have state machine tests for all of these transitions

I'll try to set some time aside to check. If you need to make haste on this, I can mark approved in the meantime though.

Yes please

Copy link
Member

@cretz cretz Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelling a command does not send a different command

I was thinking COMMAND_TYPE_CANCEL_TIMER command may come through cancelCommand somehow. But I suppose not. I have not followed the logic for WorkflowStateMachine.cancellableCommands enough to see when the data is no longer referenced.

Again this PR does not change any behaviour it is and we have state machine tests for all of these transitions

I am not sure we have tests that confirm the behavior across versions for replay safety. My concern was more about version change rather than whether the current code works within itself.

Yes please

Ok, marking approved. I am still a bit worried on taking different code paths in workflow code based on when the GC chooses to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we have tests that confirm the behavior across versions for replay safety. My concern was more about version change rather than whether the current code works within itself.

So for your peace of mind would a replay test with a workflow that cancels a timer in the same WFT it is scheduled ?

Copy link
Member

@cretz cretz Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, specifically a test that does a replay of older-version workflow history that now would call this cancelCommand with the weak reference unset would give peace of mind (you may have to invoke System.gc() manually to ensure the weak reference is gone, but I wouldn't expect you to be able to assert that in a unit test, but this kinda shows how hard it is to properly test the code path introduced). But by no means is my peace of mind required :-) If you're confident, you don't have to spend extra time on this. Most of this is my lack of time to devote to understanding the different code paths based on GC status.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving apprehensively just because I am unable to devote enough time to obtaining higher confidence in using weak references in workflow state machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Async activity inputs potential memory leak
2 participants