-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak of command inputs #2262
base: master
Are you sure you want to change the base?
Conversation
// the command in memory longer than needed. This is safe because if no other state machine holds | ||
// a reference | ||
// to the command then cancelling it wouldn't have any effect. | ||
private WeakReference<CancellableCommand> commandRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm. Can you clarify "This is safe because if no other state machine holds a reference to the command then cancelling it wouldn't have any effect" a bit? For example, I am reading TimerStateMachine
and not seeing where another state machine would hold this command.
In general, why is the reference to this overall EntityStateMachineInitialCommand
being held on to longer than it's cancellable? And if the lifetime of EntityStateMachineInitialCommand
is longer than its lifetime of being cancellable, is there a more explicit way to mark this as no longer cancellable instead of relying on Java references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify "This is safe because if no other state machine holds a reference to the command then cancelling it wouldn't have any effect"
If nothing else is holding a reference to the command then nothing can observe that it was cancelled.
Why is the reference to this overall EntityStateMachineInitialCommand
The state machine needs to be held as long as whatever operation is running
is there a more explicit way to mark this as no longer cancellable instead of relying on Java references
No, there is nothing generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nothing else is holding a reference to the command then nothing can observe that it was cancelled
So does this turn cancel into a no-op? Meaning is there a possibility that nothing holds a reference but we still need to issue the cancel command for determinism correctness? Can you demonstrate a case from a workflow author POV where the TimerStateMachine
is still held but it cannot be canceled? Does this only apply for detached cancellation scope because otherwise all other cancellation scopes are parented to the root and must issue cancels when the root is canceled? (I am not completely familiar with cancellation scope hierarchy in Java SDK)
Or am I confusing this cancellable command that affects how the user sees cancel, with the ability to send a cancel command on task complete?
The state machine needs to be held as long as whatever operation is running
I guess I am a bit confused where a state machine is running but cannot be cancelled.
I wonder if there's a better way to split off cancellation from the command. It seems that the issue is that the ability to cancel is holding more information than it needs. Can we make cancel only hold the information it needs instead of the entire command with input/output? Then that even helps people that do still need to cancel things I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancelling was a no-op before and after this change. This change just means we don't hold only the command ,which may potentially be large, in memory
I guess I am a bit confused where a state machine is running but cannot be cancelled.
This is the command being canceled not the state machine
This change has not effect on the correctness of the SDK, it is purely about releasing memory when it is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you demonstrate a case from a workflow author POV where the TimerStateMachine
is still held but this commandRef
reference may be gone? And if commandRef
is gone, does that mean it cannot be canceled?
This change just means we don't hold only the command ,which may potentially be large, in memory
The question becomes why does the command have to be all or none? Can cancellation part/ability of a command be separated from the large-memory part of the command?
This change has not effect on the correctness of the SDK, it is purely about releasing memory when it is not needed.
Right, I am just a bit concerned relying on Java reference counts now for cancellation just because some input/output state is being held along with it and we want to save memory (instead of separating cancellation from the other memory it holds).
I may just need to see a workflow where this weak reference can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a workflow test like that for activity as well
Hrmm, not sure that test covers whether a cancel command is sent (or even if the activity gets scheduled or maybe the signal is sent in the same task). I may write a test just to check. So question, in that test, is the weak reference for the activity cancellable command present by the time this line is reached:
Line 95 in d1dc2e1
cancellationScope.cancel(); |
I'll try to set some time aside to check. If you need to make haste on this, I can mark approved in the meantime though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are confused , cancelling a command does not send a different command. Again this PR does not change any behaviour it is and we have state machine tests for all of these transitions
I'll try to set some time aside to check. If you need to make haste on this, I can mark approved in the meantime though.
Yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cancelling a command does not send a different command
I was thinking COMMAND_TYPE_CANCEL_TIMER
command may come through cancelCommand
somehow. But I suppose not. I have not followed the logic for WorkflowStateMachine.cancellableCommands
enough to see when the data is no longer referenced.
Again this PR does not change any behaviour it is and we have state machine tests for all of these transitions
I am not sure we have tests that confirm the behavior across versions for replay safety. My concern was more about version change rather than whether the current code works within itself.
Yes please
Ok, marking approved. I am still a bit worried on taking different code paths in workflow code based on when the GC chooses to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we have tests that confirm the behavior across versions for replay safety. My concern was more about version change rather than whether the current code works within itself.
So for your peace of mind would a replay test with a workflow that cancels a timer in the same WFT it is scheduled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, specifically a test that does a replay of older-version workflow history that now would call this cancelCommand
with the weak reference unset would give peace of mind (you may have to invoke System.gc()
manually to ensure the weak reference is gone, but I wouldn't expect you to be able to assert that in a unit test, but this kinda shows how hard it is to properly test the code path introduced). But by no means is my peace of mind required :-) If you're confident, you don't have to spend extra time on this. Most of this is my lack of time to devote to understanding the different code paths based on GC status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving apprehensively just because I am unable to devote enough time to obtaining higher confidence in using weak references in workflow state machines.
01511e0
to
d6a67e0
Compare
Fix memory leak of command inputs. Before this change the Java SDK was holding onto the input of various operations (activities, local activities, child workflow) after the operation had already started for two reasons:
This could cause an OOM error if a lot of async operations were started in a specific workflow.
There is another memory leak in the SDK where the JVM cannot garbage collect complete state machines because they are being held by the cancellation callback in the CancellationScope. This PR does not address this issue because it is significantly more complicated, given the above fixes the impact is much less since only the relatively small amount of data for the statemachine is retained
closes #2203