Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pro-active dataflow termination #306

Open
frankmcsherry opened this issue Nov 16, 2019 · 0 comments · May be fixed by #519
Open

Pro-active dataflow termination #306

frankmcsherry opened this issue Nov 16, 2019 · 0 comments · May be fixed by #519
Assignees

Comments

@frankmcsherry
Copy link
Member

At the moment dataflows only shut down through clean completion of the work within them. One releases all capabilities at inputs, and awaits the draining of work within the dataflow. One can imagine situations where ill-conceived dataflows start up that will create more work than the system can expect to handle, and we should have a way to terminate these dataflows without awaiting their successful completion.

It seems like most of the dataflows are instantiated by trees of owned data, with connection points back to the communication infrastructure. In principle we could drop these dataflows, and a fair amount of collection will happen automatically.

This is probably an idealized view of shutdown, and there are several issues one can already expect.

For example, if the channels associated with a dataflow go away, on the receipt of messages for these channels the system cannot currently distinguish between an old dataflow that has been shut down, and a new dataflow that has not yet started up. The system will reconstruct these channels, stash the messages, and await the dataflow construction to collect them. This won't happen, and the effect would be a memory leak.

We also have existed in the pleasant world where dataflows are only shut down after the system is certain there will not be additional messages, and so there may be cases where post-shutdown message receipt causes more egregious errors, up through panicking. At the moment we just don't know.

This issue also relates to the use of capability multisets, in that use case is for messages with empty capability sets, which also have the problem that they may exist beyond the lifetime of a dataflow: one could learn that the dataflow can be terminated (all capabilities are gone) and yet still receive messages without capabilities, also currently resulting in memory leaks or worse. A fix here might remove that blocker for multiset capabilities.


One candidate solution is to inform the communication plane of shut down channels, and then to discard messages for these channels. The current channel allocation mechanism uses sequentially assigned identifiers, and the system should be able to distinguish between "yet to be created" and "created and terminated" channels just by looking at the identifiers and comparing them with the next-to-be-assigned identifier. Messages for old identifiers without an associated channel should simply be discarded.

Dataflow destruction should also happen on all workers, but it seems like there isn't the same urgency that it happens in exactly the same order on all workers. Dataflows shut down on a subset of workers should (ideally) just block dataflow execution on other workers rather than result in unrecoverable errors.

@frankmcsherry frankmcsherry self-assigned this Nov 16, 2019
benesch added a commit to benesch/timely-dataflow that referenced this issue Mar 31, 2023
Between 2673e8c and ff516c8, drop_dataflow appears to be working
well, based on testing in Materialize
(MaterializeInc/materialize#18442). So remove the "public beta" warning
from the docstring.

Fix TimelyDataflow#306.
@benesch benesch linked a pull request Mar 31, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant