Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new interface for output #3793

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

A new interface for output #3793

wants to merge 1 commit into from

Conversation

glwagner
Copy link
Member

@glwagner glwagner commented Sep 26, 2024

I've long been unsatisfied with how we build output. It requires a lot of typing --- that is, boilerplate. I often feel a sense of dread when I have to go beyond "visualizing the final iteration" of a prototype to defining an output writer. So much typing.

This PR is an attempt to make output easier and more fun. I use JLD2 as an example but if there is some consensus then I think this PR should extend the same to NetCDF.

The main thrust of this PR is a new function called output!. It works like this:

output!(simulation, outputs; schedule=TimeInterval(1), filename="low_hanging_fruit")

The default is JLD2Format(). For NetCDF users would write

output!(simulation, outputs, NetCDFFormat(); kw...)

The function adds an output writer to simulation, choosing a "generic name" for the simulation.output_writers dictionary. Does this enable one line output writing?

I'd love to hear feedback about this design. I implemented it in the two_dimensional_turbulence.jl example for illustration:

output!(simulation, (; ω, s); schedule=TimeInterval(0.6), filename)

There are two more things. First, we need to default with_halos=true for JLD2OutputWriter. The time has come because FieldTimeSeries is mature. We do, in fact, want halos.

The other conundrum is overwrite_existing which is discussed on #3543. In short I am wondering whether the best course of action is simply to make default overwrite_existing=true and solve so much boilerplate. Simulations are cheap, but life is short!

PS I also want to change add_callback! to just callback!.

@glwagner
Copy link
Member Author

There's an intriguing side benefit of this "unified" interface for output. It means that it is possible (though we don't have it now) for users to specify an "output preference" in a Preferences.toml, which would then determine the default behavior of output!. I think it also legitimately makes it easier to switching between formats.

@glwagner
Copy link
Member Author

Another abstraction I think would be useful is a utility for building multiple outputs. Imagine this:

indices = (xy=(:, :, k), xz=(:, 1, :), yz=(1, :, :))
sliced_outputs!(simulation, outputs, indices; schedule=TimeInterval(1), filename="sliced")

this would append filename with the keys of indices, eg we would get 3 outputs titled "sliced_xy", "sliced_xz", and "sliced_yz".

I personally find myself using indices the most for this kind of thing but others might have additional input.

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Sep 26, 2024

I'd love to hear feedback about this design.

Maybe I'm personally quite picky about specifying outputs and output file names, so I might always end up with verbose boilerplate for output writing (and I'm personally fine with that). But I'd support reducing boilerplate and maybe just a bit of flexibility would work even for picky people! I remember having this conversation in #1171 actually!

One-line flexible output writing would be especially great for examples, new user friendliness, and quick iteration.

Some thoughts:

  1. I think the name output! is a bit vague in what it does. Does it just output the current state of the simulation? Would add_output_writers! be clearer and align more closely with existing Oceananigans nomenclature? For the same reason, I'd suggest keeping add_callback! over renaming to callback!.
  2. I frequently output both JLD2 and NetCDF versions of the exact same data. So ideally subsequent calls to add_output_writers! with JLD2Format() and NetCDFFormat() would add both.
  3. Would definitely support extending this to NetCDF as well!

First, we need to default with_halos=true for JLD2OutputWriter. The time has come because FieldTimeSeries is mature. We do, in fact, want halos.

Sounds good! Will be nice for derivatives to work by default. Although FieldTimeSeries could do with a bit more maturing, e.g. issues #3144 and #3750.

The other conundrum is overwrite_existing which is discussed on #3543. In short I am wondering whether the best course of action is simply to make default overwrite_existing=true and solve so much boilerplate.

Is the "so much boilerplate" just the extra one line overwrite_existing = true for each output writer? Maybe I'm too conservative here but I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

But I really like the suggestions in #3543 of having the option to save output in unique directories be easily specifiable.

If we want an easy default, then maybe it could do some version of the unique directories? Or maybe have overwrite_existing = false (the default) just rename existing files?

I'm not sure of the best approach but as someone who's conservative about overwriting by default I'm tempted to err on the side of caution.

Another abstraction I think would be useful is a utility for building multiple outputs. Imagine this:

indices = (xy=(:, :, k), xz=(:, 1, :), yz=(1, :, :))
sliced_outputs!(simulation, outputs, indices; schedule=TimeInterval(1), filename="sliced")

this would append filename with the keys of indices, eg we would get 3 outputs titled "sliced_xy", "sliced_xz", and "sliced_yz".

Love this idea! Hoping that you can also pass e.g. (surface=(:, :, k), zonal=(:, 1, :), meridional=(1, :, :)) to get sliced_surface, sliced_zonal, and sliced_meridional.

@glwagner
Copy link
Member Author

Maybe I'm personally quite picky about specifying outputs and output file names, so I might always end up with verbose boilerplate for output writing

Totally and to be clear, when we think about the economy of an interface, we are thinking about prototyping, illustrating, testing, not necessarily "production". I think "production" places fewer demands on the user interface and what we have now is ok for production. This PR mainly improves the small stuff. Also arguably it's more helpful for experienced than new users.

I think the name output! is a bit vague in what it does. Does it just output the current state of the simulation? Would add_output_writers! be clearer and align more closely with existing Oceananigans nomenclature? For the same reason, I'd suggest keeping add_callback! over renaming to callback!.

I agree that with "add" and "writer" the meaning is cemented. I think it's important to recognize trade-offs though, because there is a limit to the benefit of being explicit (when things become hard to read or understand). I think in this case I accept that output_writer! is probably better than ouput!. I think prepending add_ has a more marginal benefit (and is a little ugly) and that context is really what drives understanding of callback! / add_callback! (eg a schedule, etc). But this is certainly open for discussion.

Love this idea! Hoping that you can also pass e.g. (surface=(:, :, k), zonal=(:, 1, :), meridional=(1, :, :)) to get sliced_surface, sliced_zonal, and sliced_meridional.

Yes for sure! In that example the keys "xy", "xz", etc would be names appended to the filename prefix.

I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

Do you run with this option? Curious because I never use it. I think the cost of losing data is actually usually very small, it's only in a small 1% of cases that the data is valuable. I think that's actually the key insight behind the default, that expensive simulations are rare so it doesn't make sense to default it.

@glwagner
Copy link
Member Author

Is the "so much boilerplate" just the extra one line overwrite_existing = true for each output writer?

It's one line --- but it's in every script, sometimes many times! Add all that up and you get to a huge amount...

@glwagner
Copy link
Member Author

glwagner commented Oct 1, 2024

@navidcy any thoughts?

I think the main feedback is to keep add_callback! and use add_output_writer! instead. I think it's not very pretty but I am ok with it. It's utilitarian 🤖 .

@josuemtzmo
Copy link
Collaborator

I agree with @ali-ramadhan, keeping a overwrite_existing seems quite important to me too, potentially with a default value of overwrite_existing = false.

Another issue to add into this discussion is the fact of how to handle output (e.g. file.nc) with checkpoints and file_splitting. Currently, when a simulation is pickup from a checkpoint having the flag overwrite_existing = true rewrites to empty all the pre-existing files (i.e. file_part1.nc, file_partN.nc). With overwrite_existing = false, the simulation crashes because it doesn't find the file file.nc. I think it will be useful to handle automatic concatenation to splitted files, to allow a more flexible output particularly in the context of HPC computing with manageable file sizes, chunks, and wall times.

@glwagner
Copy link
Member Author

glwagner commented Oct 3, 2024

potentially with a default value of overwrite_existing = false.

Do you run with overwrite_existing=false? (Outside the context of restoring from a checkpoint.)

@glwagner
Copy link
Member Author

glwagner commented Oct 3, 2024

Another issue to add into this discussion is the fact of how to handle output (e.g. file.nc) with checkpoints and file_splitting.

I propose handling this by initializing output files within run! rather than during instantiation of the output writer. This is separate from the interface discussion here though.

@josuemtzmo
Copy link
Collaborator

I propose handling this by initializing output files within run! rather than during instantiation of the output writer. This is separate from the interface discussion here though.

Ok, I see. I agree that that discussion can be left to the other PR.

@josuemtzmo
Copy link
Collaborator

josuemtzmo commented Oct 3, 2024

potentially with a default value of overwrite_existing = false.

Do you run with overwrite_existing=false? (Outside the context of restoring from a checkpoint.)

I agree, that is not a common use case scenario. I have only used overwrite_existing=false without a checkpoint for short tests within the same Julia instance to extend the model output.

@glwagner
Copy link
Member Author

glwagner commented Oct 3, 2024

I have only used overwrite_existing=false without a checkpoint for short tests within the same Julia instance to extend the model output.

Thank you for pointing out this use case. I think this is another situation that could be solved by waiting until run! for initialization. We can analyze an existing file and determine whether or not any data within the file will be overwritten based on the simulation parameters (current time, stop time).

Another idea by the way would be to move the concept of "overwriting" to run!, as well. Then the single keyword can apply to all output, or not, which presumably more aligned with what a user would want (rather than toggling overwrite_existing for each writer individually).

@glwagner
Copy link
Member Author

glwagner commented Oct 4, 2024

Another idea after talking with @josuemtzmo: add another function called checkpoint!(simulation, schedule) that adds a checkpointer with an automatically generated name. (More to come on this idea)

@glwagner
Copy link
Member Author

glwagner commented Oct 5, 2024

Another utility that I believe is needed is a function that displays the information in an output file. For example something like

julia> outputinfo(filename)

which displays things like

  • the names of the fields
  • information about the fields? location, indices, size
  • the grid
  • the number of output snapshots
  • the output times

anything else?

@ali-ramadhan
Copy link
Member

I'm not too picky about add_output_writer! vs. output_writer! although output! is a bit too vague. I might lean too utilitarian though 🤖 Curious what other people think.

I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

Do you run with this option? Curious because I never use it. I think the cost of losing data is actually usually very small, it's only in a small 1% of cases that the data is valuable. I think that's actually the key insight behind the default, that expensive simulations are rare so it doesn't make sense to default it.

I do actually haha. But what I also do is set different output directories for each run so I can always go back and compare different runs as I'm iteratively modifying stuff. So I guess setting different directories is guaranteeing that I never overwrite existing files, but then overwrite_existing = false is only a guardrail for me.

Another utility that I believe is needed is a function that displays the information in an output file.

We might almost already have this with the show function of FieldDataset if it's called with backend=OnDisk(). Maybe it could be shared/re-purposed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants