Fix memory leak of command inputs #2262

Quinn-With-Two-Ns · 2024-10-09T17:47:44Z

Fix memory leak of command inputs. Before this change the Java SDK was holding onto the input of various operations (activities, local activities, child workflow) after the operation had already started for two reasons:

The state machine was holding a strong reference to the command even after it had been sent to the server
The output handle needed some of the type informations from the input, but that kept a reference to the input parameters as well

This could cause an OOM error if a lot of async operations were started in a specific workflow.

There is another memory leak in the SDK where the JVM cannot garbage collect complete state machines because they are being held by the cancellation callback in the CancellationScope. This PR does not address this issue because it is significantly more complicated, given the above fixes the impact is much less since only the relatively small amount of data for the statemachine is retained

closes #2203

cretz · 2024-10-09T19:27:47Z

...l-sdk/src/main/java/io/temporal/internal/statemachines/EntityStateMachineInitialCommand.java

+  // the command in memory longer than needed. This is safe because if no other state machine holds
+  // a reference
+  // to the command then cancelling it wouldn't have any effect.
+  private WeakReference<CancellableCommand> commandRef;


Hrmm. Can you clarify "This is safe because if no other state machine holds a reference to the command then cancelling it wouldn't have any effect" a bit? For example, I am reading TimerStateMachine and not seeing where another state machine would hold this command.

In general, why is the reference to this overall EntityStateMachineInitialCommand being held on to longer than it's cancellable? And if the lifetime of EntityStateMachineInitialCommand is longer than its lifetime of being cancellable, is there a more explicit way to mark this as no longer cancellable instead of relying on Java references?

Can you clarify "This is safe because if no other state machine holds a reference to the command then cancelling it wouldn't have any effect"

If nothing else is holding a reference to the command then nothing can observe that it was cancelled.

Why is the reference to this overall EntityStateMachineInitialCommand

The state machine needs to be held as long as whatever operation is running

is there a more explicit way to mark this as no longer cancellable instead of relying on Java references

No, there is nothing generic

If nothing else is holding a reference to the command then nothing can observe that it was cancelled

So does this turn cancel into a no-op? Meaning is there a possibility that nothing holds a reference but we still need to issue the cancel command for determinism correctness? Can you demonstrate a case from a workflow author POV where the TimerStateMachine is still held but it cannot be canceled? Does this only apply for detached cancellation scope because otherwise all other cancellation scopes are parented to the root and must issue cancels when the root is canceled? (I am not completely familiar with cancellation scope hierarchy in Java SDK)

Or am I confusing this cancellable command that affects how the user sees cancel, with the ability to send a cancel command on task complete?

The state machine needs to be held as long as whatever operation is running

I guess I am a bit confused where a state machine is running but cannot be cancelled.

I wonder if there's a better way to split off cancellation from the command. It seems that the issue is that the ability to cancel is holding more information than it needs. Can we make cancel only hold the information it needs instead of the entire command with input/output? Then that even helps people that do still need to cancel things I assume.

Cancelling was a no-op before and after this change. This change just means we don't hold only the command ,which may potentially be large, in memory

I guess I am a bit confused where a state machine is running but cannot be cancelled.

This is the command being canceled not the state machine

This change has not effect on the correctness of the SDK, it is purely about releasing memory when it is not needed.

Can you demonstrate a case from a workflow author POV where the TimerStateMachine is still held but this commandRef reference may be gone? And if commandRef is gone, does that mean it cannot be canceled?

This change just means we don't hold only the command ,which may potentially be large, in memory

The question becomes why does the command have to be all or none? Can cancellation part/ability of a command be separated from the large-memory part of the command?

This change has not effect on the correctness of the SDK, it is purely about releasing memory when it is not needed.

Right, I am just a bit concerned relying on Java reference counts now for cancellation just because some input/output state is being held along with it and we want to save memory (instead of separating cancellation from the other memory it holds).

I may just need to see a workflow where this weak reference can be removed.

we have a workflow test like that for activity as well

Hrmm, not sure that test covers whether a cancel command is sent (or even if the activity gets scheduled or maybe the signal is sent in the same task). I may write a test just to check. So question, in that test, is the weak reference for the activity cancellable command present by the time this line is reached:

sdk-java/temporal-sdk/src/test/java/io/temporal/workflow/activityTests/cancellation/CancellingScheduledActivityTest.java

Line 95 in d1dc2e1

cancellationScope.cancel();

I'll try to set some time aside to check. If you need to make haste on this, I can mark approved in the meantime though.

I think you are confused , cancelling a command does not send a different command. Again this PR does not change any behaviour it is and we have state machine tests for all of these transitions

I'll try to set some time aside to check. If you need to make haste on this, I can mark approved in the meantime though.

Yes please

cancelling a command does not send a different command

I was thinking COMMAND_TYPE_CANCEL_TIMER command may come through cancelCommand somehow. But I suppose not. I have not followed the logic for WorkflowStateMachine.cancellableCommands enough to see when the data is no longer referenced.

Again this PR does not change any behaviour it is and we have state machine tests for all of these transitions

I am not sure we have tests that confirm the behavior across versions for replay safety. My concern was more about version change rather than whether the current code works within itself.

Yes please

Ok, marking approved. I am still a bit worried on taking different code paths in workflow code based on when the GC chooses to run.

I am not sure we have tests that confirm the behavior across versions for replay safety. My concern was more about version change rather than whether the current code works within itself.

So for your peace of mind would a replay test with a workflow that cancels a timer in the same WFT it is scheduled ?

Yes, specifically a test that does a replay of older-version workflow history that now would call this cancelCommand with the weak reference unset would give peace of mind (you may have to invoke System.gc() manually to ensure the weak reference is gone, but I wouldn't expect you to be able to assert that in a unit test, but this kinda shows how hard it is to properly test the code path introduced). But by no means is my peace of mind required :-) If you're confident, you don't have to spend extra time on this. Most of this is my lack of time to devote to understanding the different code paths based on GC status.

cretz

Approving apprehensively just because I am unable to devote enough time to obtaining higher confidence in using weak references in workflow state machines.

Quinn-With-Two-Ns requested a review from a team as a code owner October 9, 2024 17:47

cretz reviewed Oct 9, 2024

View reviewed changes

cretz approved these changes Oct 10, 2024

View reviewed changes

Fix memory leak of command inputs

d6a67e0

Quinn-With-Two-Ns force-pushed the issue-2203 branch from 01511e0 to d6a67e0 Compare October 11, 2024 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak of command inputs #2262

Fix memory leak of command inputs #2262

Quinn-With-Two-Ns commented Oct 9, 2024

cretz Oct 9, 2024 •

edited

Loading

Quinn-With-Two-Ns Oct 9, 2024

cretz Oct 9, 2024 •

edited

Loading

Quinn-With-Two-Ns Oct 9, 2024

cretz Oct 9, 2024 •

edited

Loading

cretz Oct 10, 2024 •

edited

Loading

Quinn-With-Two-Ns Oct 10, 2024

cretz Oct 10, 2024 •

edited

Loading

Quinn-With-Two-Ns Oct 10, 2024

cretz Oct 10, 2024 •

edited

Loading

cretz left a comment

Fix memory leak of command inputs #2262

Are you sure you want to change the base?

Fix memory leak of command inputs #2262

Conversation

Quinn-With-Two-Ns commented Oct 9, 2024

cretz Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Quinn-With-Two-Ns Oct 9, 2024

Choose a reason for hiding this comment

cretz Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Quinn-With-Two-Ns Oct 9, 2024

Choose a reason for hiding this comment

cretz Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

cretz Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Quinn-With-Two-Ns Oct 10, 2024

Choose a reason for hiding this comment

cretz Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Quinn-With-Two-Ns Oct 10, 2024

Choose a reason for hiding this comment

cretz Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

cretz left a comment

Choose a reason for hiding this comment

cretz Oct 9, 2024 •

edited

Loading

cretz Oct 9, 2024 •

edited

Loading

cretz Oct 9, 2024 •

edited

Loading

cretz Oct 10, 2024 •

edited

Loading

cretz Oct 10, 2024 •

edited

Loading

cretz Oct 10, 2024 •

edited

Loading