A new interface for output #3793

glwagner · 2024-09-26T02:13:31Z

I've long been unsatisfied with how we build output. It requires a lot of typing --- that is, boilerplate. I often feel a sense of dread when I have to go beyond "visualizing the final iteration" of a prototype to defining an output writer. So much typing.

This PR is an attempt to make output easier and more fun. I use JLD2 as an example but if there is some consensus then I think this PR should extend the same to NetCDF.

The main thrust of this PR is a new function called output!. It works like this:

output!(simulation, outputs; schedule=TimeInterval(1), filename="low_hanging_fruit")

The default is JLD2Format(). For NetCDF users would write

output!(simulation, outputs, NetCDFFormat(); kw...)

The function adds an output writer to simulation, choosing a "generic name" for the simulation.output_writers dictionary. Does this enable one line output writing?

I'd love to hear feedback about this design. I implemented it in the two_dimensional_turbulence.jl example for illustration:

Oceananigans.jl/examples/two_dimensional_turbulence.jl

Line 109 in 6a6853d

output!(simulation, (; ω, s); schedule=TimeInterval(0.6), filename)

There are two more things. First, we need to default with_halos=true for JLD2OutputWriter. The time has come because FieldTimeSeries is mature. We do, in fact, want halos.

The other conundrum is overwrite_existing which is discussed on #3543. In short I am wondering whether the best course of action is simply to make default overwrite_existing=true and solve so much boilerplate. Simulations are cheap, but life is short!

PS I also want to change add_callback! to just callback!.

glwagner · 2024-09-26T02:24:54Z

There's an intriguing side benefit of this "unified" interface for output. It means that it is possible (though we don't have it now) for users to specify an "output preference" in a Preferences.toml, which would then determine the default behavior of output!. I think it also legitimately makes it easier to switching between formats.

glwagner · 2024-09-26T03:21:06Z

Another abstraction I think would be useful is a utility for building multiple outputs. Imagine this:

indices = (xy=(:, :, k), xz=(:, 1, :), yz=(1, :, :))
sliced_outputs!(simulation, outputs, indices; schedule=TimeInterval(1), filename="sliced")

this would append filename with the keys of indices, eg we would get 3 outputs titled "sliced_xy", "sliced_xz", and "sliced_yz".

I personally find myself using indices the most for this kind of thing but others might have additional input.

ali-ramadhan · 2024-09-26T04:50:17Z

I'd love to hear feedback about this design.

Maybe I'm personally quite picky about specifying outputs and output file names, so I might always end up with verbose boilerplate for output writing (and I'm personally fine with that). But I'd support reducing boilerplate and maybe just a bit of flexibility would work even for picky people! I remember having this conversation in #1171 actually!

One-line flexible output writing would be especially great for examples, new user friendliness, and quick iteration.

Some thoughts:

I think the name output! is a bit vague in what it does. Does it just output the current state of the simulation? Would add_output_writers! be clearer and align more closely with existing Oceananigans nomenclature? For the same reason, I'd suggest keeping add_callback! over renaming to callback!.
I frequently output both JLD2 and NetCDF versions of the exact same data. So ideally subsequent calls to add_output_writers! with JLD2Format() and NetCDFFormat() would add both.
Would definitely support extending this to NetCDF as well!

First, we need to default with_halos=true for JLD2OutputWriter. The time has come because FieldTimeSeries is mature. We do, in fact, want halos.

Sounds good! Will be nice for derivatives to work by default. Although FieldTimeSeries could do with a bit more maturing, e.g. issues #3144 and #3750.

The other conundrum is overwrite_existing which is discussed on #3543. In short I am wondering whether the best course of action is simply to make default overwrite_existing=true and solve so much boilerplate.

Is the "so much boilerplate" just the extra one line overwrite_existing = true for each output writer? Maybe I'm too conservative here but I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

But I really like the suggestions in #3543 of having the option to save output in unique directories be easily specifiable.

If we want an easy default, then maybe it could do some version of the unique directories? Or maybe have overwrite_existing = false (the default) just rename existing files?

I'm not sure of the best approach but as someone who's conservative about overwriting by default I'm tempted to err on the side of caution.

Another abstraction I think would be useful is a utility for building multiple outputs. Imagine this:
indices = (xy=(:, :, k), xz=(:, 1, :), yz=(1, :, :))
sliced_outputs!(simulation, outputs, indices; schedule=TimeInterval(1), filename="sliced")
this would append filename with the keys of indices, eg we would get 3 outputs titled "sliced_xy", "sliced_xz", and "sliced_yz".

Love this idea! Hoping that you can also pass e.g. (surface=(:, :, k), zonal=(:, 1, :), meridional=(1, :, :)) to get sliced_surface, sliced_zonal, and sliced_meridional.

glwagner · 2024-09-26T13:07:16Z

Maybe I'm personally quite picky about specifying outputs and output file names, so I might always end up with verbose boilerplate for output writing

Totally and to be clear, when we think about the economy of an interface, we are thinking about prototyping, illustrating, testing, not necessarily "production". I think "production" places fewer demands on the user interface and what we have now is ok for production. This PR mainly improves the small stuff. Also arguably it's more helpful for experienced than new users.

I think the name output! is a bit vague in what it does. Does it just output the current state of the simulation? Would add_output_writers! be clearer and align more closely with existing Oceananigans nomenclature? For the same reason, I'd suggest keeping add_callback! over renaming to callback!.

I agree that with "add" and "writer" the meaning is cemented. I think it's important to recognize trade-offs though, because there is a limit to the benefit of being explicit (when things become hard to read or understand). I think in this case I accept that output_writer! is probably better than ouput!. I think prepending add_ has a more marginal benefit (and is a little ugly) and that context is really what drives understanding of callback! / add_callback! (eg a schedule, etc). But this is certainly open for discussion.

Love this idea! Hoping that you can also pass e.g. (surface=(:, :, k), zonal=(:, 1, :), meridional=(1, :, :)) to get sliced_surface, sliced_zonal, and sliced_meridional.

Yes for sure! In that example the keys "xy", "xz", etc would be names appended to the filename prefix.

I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

Do you run with this option? Curious because I never use it. I think the cost of losing data is actually usually very small, it's only in a small 1% of cases that the data is valuable. I think that's actually the key insight behind the default, that expensive simulations are rare so it doesn't make sense to default it.

glwagner · 2024-09-26T13:08:25Z

Is the "so much boilerplate" just the extra one line overwrite_existing = true for each output writer?

It's one line --- but it's in every script, sometimes many times! Add all that up and you get to a huge amount...

glwagner · 2024-10-01T12:04:37Z

@navidcy any thoughts?

I think the main feedback is to keep add_callback! and use add_output_writer! instead. I think it's not very pretty but I am ok with it. It's utilitarian 🤖 .

josuemtzmo · 2024-10-03T15:27:33Z

I agree with @ali-ramadhan, keeping a overwrite_existing seems quite important to me too, potentially with a default value of overwrite_existing = false.

Another issue to add into this discussion is the fact of how to handle output (e.g. file.nc) with checkpoints and file_splitting. Currently, when a simulation is pickup from a checkpoint having the flag overwrite_existing = true rewrites to empty all the pre-existing files (i.e. file_part1.nc, file_partN.nc). With overwrite_existing = false, the simulation crashes because it doesn't find the file file.nc. I think it will be useful to handle automatic concatenation to splitted files, to allow a more flexible output particularly in the context of HPC computing with manageable file sizes, chunks, and wall times.

glwagner · 2024-10-03T15:30:11Z

potentially with a default value of overwrite_existing = false.

Do you run with overwrite_existing=false? (Outside the context of restoring from a checkpoint.)

glwagner · 2024-10-03T15:31:08Z

Another issue to add into this discussion is the fact of how to handle output (e.g. file.nc) with checkpoints and file_splitting.

I propose handling this by initializing output files within run! rather than during instantiation of the output writer. This is separate from the interface discussion here though.

josuemtzmo · 2024-10-03T16:53:40Z

I propose handling this by initializing output files within run! rather than during instantiation of the output writer. This is separate from the interface discussion here though.

Ok, I see. I agree that that discussion can be left to the other PR.

josuemtzmo · 2024-10-03T16:56:10Z

potentially with a default value of overwrite_existing = false.

Do you run with overwrite_existing=false? (Outside the context of restoring from a checkpoint.)

I agree, that is not a common use case scenario. I have only used overwrite_existing=false without a checkpoint for short tests within the same Julia instance to extend the model output.

glwagner · 2024-10-03T19:28:01Z

I have only used overwrite_existing=false without a checkpoint for short tests within the same Julia instance to extend the model output.

Thank you for pointing out this use case. I think this is another situation that could be solved by waiting until run! for initialization. We can analyze an existing file and determine whether or not any data within the file will be overwritten based on the simulation parameters (current time, stop time).

Another idea by the way would be to move the concept of "overwriting" to run!, as well. Then the single keyword can apply to all output, or not, which presumably more aligned with what a user would want (rather than toggling overwrite_existing for each writer individually).

src/Simulations/callback.jl

glwagner · 2024-10-04T15:07:37Z

Another idea after talking with @josuemtzmo: add another function called checkpoint!(simulation, schedule) that adds a checkpointer with an automatically generated name. (More to come on this idea)

glwagner · 2024-10-05T15:11:47Z

Another utility that I believe is needed is a function that displays the information in an output file. For example something like

julia> outputinfo(filename)

which displays things like

the names of the fields
information about the fields? location, indices, size
the grid
the number of output snapshots
the output times

anything else?

ali-ramadhan · 2024-10-07T14:15:59Z

I'm not too picky about add_output_writer! vs. output_writer! although output! is a bit too vague. I might lean too utilitarian though 🤖 Curious what other people think.

I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

Do you run with this option? Curious because I never use it. I think the cost of losing data is actually usually very small, it's only in a small 1% of cases that the data is valuable. I think that's actually the key insight behind the default, that expensive simulations are rare so it doesn't make sense to default it.

I do actually haha. But what I also do is set different output directories for each run so I can always go back and compare different runs as I'm iteratively modifying stuff. So I guess setting different directories is guaranteeing that I never overwrite existing files, but then overwrite_existing = false is only a guardrail for me.

Another utility that I believe is needed is a function that displays the information in an output file.

We might almost already have this with the show function of FieldDataset if it's called with backend=OnDisk(). Maybe it could be shared/re-purposed?

Introducing output helpers

6a6853d

navidcy added the output 💾 label Sep 30, 2024

glwagner mentioned this pull request Oct 3, 2024

Improve concatenation of netCDF files with simulation pickup #3818

Open

josuemtzmo reviewed Oct 4, 2024

View reviewed changes

src/Simulations/callback.jl Show resolved Hide resolved

glwagner commented Oct 4, 2024

View reviewed changes

src/Simulations/callback.jl Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new interface for output #3793

A new interface for output #3793

glwagner commented Sep 26, 2024 •

edited

Loading

glwagner commented Sep 26, 2024

glwagner commented Sep 26, 2024

ali-ramadhan commented Sep 26, 2024 •

edited

Loading

glwagner commented Sep 26, 2024

glwagner commented Sep 26, 2024

glwagner commented Oct 1, 2024 •

edited

Loading

josuemtzmo commented Oct 3, 2024

glwagner commented Oct 3, 2024

glwagner commented Oct 3, 2024

josuemtzmo commented Oct 3, 2024

josuemtzmo commented Oct 3, 2024 •

edited

Loading

glwagner commented Oct 3, 2024 •

edited

Loading

glwagner commented Oct 4, 2024 •

edited

Loading

glwagner commented Oct 5, 2024

ali-ramadhan commented Oct 7, 2024

A new interface for output #3793

Are you sure you want to change the base?

A new interface for output #3793

Conversation

glwagner commented Sep 26, 2024 • edited Loading

glwagner commented Sep 26, 2024

glwagner commented Sep 26, 2024

ali-ramadhan commented Sep 26, 2024 • edited Loading

glwagner commented Sep 26, 2024

glwagner commented Sep 26, 2024

glwagner commented Oct 1, 2024 • edited Loading

josuemtzmo commented Oct 3, 2024

glwagner commented Oct 3, 2024

glwagner commented Oct 3, 2024

josuemtzmo commented Oct 3, 2024

josuemtzmo commented Oct 3, 2024 • edited Loading

glwagner commented Oct 3, 2024 • edited Loading

glwagner commented Oct 4, 2024 • edited Loading

glwagner commented Oct 5, 2024

ali-ramadhan commented Oct 7, 2024

glwagner commented Sep 26, 2024 •

edited

Loading

ali-ramadhan commented Sep 26, 2024 •

edited

Loading

glwagner commented Oct 1, 2024 •

edited

Loading

josuemtzmo commented Oct 3, 2024 •

edited

Loading

glwagner commented Oct 3, 2024 •

edited

Loading

glwagner commented Oct 4, 2024 •

edited

Loading