Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Walk file tree on directory publish #3933

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bentsherman
Copy link
Member

Close #3372

I just changed the publish dir to explicitly publish each file in a directory. I think it means that if the publish mode is symlink, each file in the directory will be symlinked instead of just the directory.

@robsyme do we just need to add a checksum comparison to S3 copies and uploads? I think I can just rip the code from this PR #3802

Copy link

netlify bot commented Feb 8, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit ad3c4cb
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/664bbe302c86c2000895f798

Signed-off-by: Ben Sherman <[email protected]>
@bentsherman bentsherman marked this pull request as ready for review February 8, 2024 23:12
@bentsherman
Copy link
Member Author

This PR is ready for review. The publish dir now publishes each file in a directory instead of just the directory, which should improve detection of Reports in Seqera Platform and fix some issues with Fusion symlinks (#4725).

It does not yet solve #3372 , for that we also need to verify the checksum of published files when deciding whether a file needs to be re-published. I can implement that here or in a separate PR, up to you @pditommaso .

@robsyme
Copy link
Collaborator

robsyme commented Feb 9, 2024

Looks good, thanks Ben!

In the case where publish method is symlinks and we are emitting/publishing a directory, what does the published output look like?

Is it a symlinked directory full of symlinks, or a regular directory full of symlinks, or something else entirely?

@bentsherman
Copy link
Member Author

It will be a regular directory of symlinks

Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should do this. On some storage traversing a directory path is extremely expensive e.g. Lustre, NFS, etc.

@bentsherman
Copy link
Member Author

I share your concern, although copying a directory also involves walking the directory tree at some level, so may it's not an issue. Either way, there are other reasons we should do this.

In general, I think it is a bad practice to define a directory output. When I see a process with a directory output, I know nothing about the file structure within that directory.

So I think we should either not support directory outputs or support them correctly. Currently we just support it incorrectly:

The performance implications you mention only happen if the user decides to use directory outputs.. I think we could simply document this point and encourage users to use flat files as much as possible.

Consider also that we already walk the work directory to collect the output files:

FileHelper.visitFiles(opts, workDir, namePattern) { Path it -> files.add(it) }

which has similar performance issues when there are nested directories

@bentsherman
Copy link
Member Author

Indeed, when we publish a directory we are already traversing it:

CopyMoveHelper.copyDirectory(source, target, options)

This PR is just doing the traversal earlier

@bentsherman
Copy link
Member Author

@robsyme what do you think about what I said about publishing a symlink directory? It is certainly a change in behavior, but I'm wondering if it's more desirable to symlink all of the files rather than only symlink the directory.

If not, I can modify the logic to only traverse the directory when the publish mode isn't symlink, but I feel like that would make the behavior inconsistent based on the publish mode.

@robsyme
Copy link
Collaborator

robsyme commented Feb 19, 2024

Thanks for moving this forward Ben, and apologies for my slow response times! My feeling is that we're stuck between a rock and a hard place, but inconsistencies between publishDir modes is the worse outcome.

As for

bad practice to define a directory output

I think it's inevitable that people will want to publish results that contain directories (many tools have complex/nested outputs).

@bentsherman
Copy link
Member Author

Indeed, even though a directory output is harder to reason about, it seems to be convenient for the first iteration of a Nextflow process for certain tools

@pditommaso
Copy link
Member

I think this is solved by #5005

@bentsherman
Copy link
Member Author

The fusion symlinks are no longer an issue, but there are other reasons to merge this:

  • when the workflow output definition uses the deep overwrite mode, which compares checksums to decide whether to overwrite a file, comparing each individual file instead of the entire directory would allow for finer-grained caching

  • walking the directory emits a publish event for each individual file, which makes them easier to capture for reports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants