Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Add --exclude-from-upload and --exclude-from-download options #370

Merged
merged 1 commit into from
May 29, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented May 24, 2024

Excluding files from upload is very handy for ad-hoc skipping of large ancillary files or previous output files in the build directory that the user wants to ignore for the remote build on AWS Batch (i.e. to start it "fresh"). Existing workarounds for the lack of an exclusion mechanism include git worktrees to obtain a clean state and moving files or directories temporarily out of the build directory.

A future improvement would be adding support to also specify these patterns via a file, which can then be checked in to a build repo and shared by all users.

As a broader improvement, I'd like to design away this issue by separating out the workflow-as-program (rules, code, etc.) from the workflow-as-state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this "excluded files" need too. For example, instead of relying on the implicit state combined with the code like now in a single build directory, we'd instead explicitly pass an empty starting state separate from the workflow program.

This change includes a small bug fix for --download to allow only negated patterns.

Resolves: #219

Checklist

  • Checks pass

@tsibley tsibley requested review from a team, jameshadfield and trvrb May 24, 2024 21:09
@tsibley tsibley force-pushed the trs/build/exclude-from-upload branch from 13064be to 8bc1632 Compare May 24, 2024 21:12
@tsibley
Copy link
Member Author

tsibley commented May 24, 2024

Ah, this has a bug around absolute vs. relative path handling and pattern specification. I'll address it and repush. Otherwise, though, still ready for 👀 (just not testing).

Excluding files from upload is very handy for ad-hoc skipping of large
ancillary files or previous output files in the build directory that the
user wants to ignore for the remote build on AWS Batch (i.e. to start it
"fresh").  Existing workarounds for the lack of an exclusion mechanism
include git worktrees to obtain a clean state and moving files or
directories temporarily out of the build directory.

A future improvement would be adding support to also specify these
patterns via a file, which can then be checked in to a build repo and
shared by all users.

As a broader improvement, I'd like to design away this issue by
separating out the workflow-as-program (rules, code, etc.) from the
workflow-as-state (config, inputs, outputs, etc.).  This comes out of
several other goals, but would apply to this "excluded files" need too.
For example, instead of relying on the implicit state combined with the
code like now in a single build directory, we'd instead explicitly pass
an empty starting state separate from the workflow program.

This change includes a small bug fix for --download to allow _only_
negated patterns.

Resolves: <#219>
@tsibley tsibley force-pushed the trs/build/exclude-from-upload branch from 8bc1632 to 6d465f0 Compare May 24, 2024 22:20
@tsibley
Copy link
Member Author

tsibley commented May 24, 2024

Bug squashed in repush.

doc/commands/build.rst Show resolved Hide resolved
@tsibley
Copy link
Member Author

tsibley commented May 24, 2024

@lmoncla @jameshadfield @trvrb If you want to test this out before merge/release to make sure it works for you, you can temporarily install the CLI built from this PR with:

curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/mac \
  | DESTINATION=/tmp/cli bash -s pr-build/370

Ignore the shell setup instructions printed at the end. To test, run /tmp/cli/nextstrain build … instead of nextstrain build ….

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good by inspection. Nice to see the implementation wasn't too far from how I thought.

@@ -97,13 +97,67 @@ options
wildcards (``**``), extended globbing (``@(…)``, ``+(…)``, etc.),
and negation (``!…``).

Patterns should be relative to the build directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially tangential to this PR

Reading this code my understanding is that only the build directory is uploaded, and so patterns are relative to that. How does this work when the build directory is (e.g.) ingest/ but within that the Snakefile refers back to a parent directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a great point. The directory here that matters—the one that's uploaded and one that patterns must be relative to—is influenced by the nextstrain-pathogen.yaml file. For example, running nextstrain build ingest/ in a directory with nextstrain-pathogen.yaml will require patterns be relative to the directory containing the marker file, not the ingest/ directory, i.e. to exclude something in ingest/ would require the pattern to be ingest/x/y/z not just x/y/z.

I'll update the docs here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I want to leave the option docs/help as-is for now for this PR. I'm realizing there's larger documentation reworking to do to clarify terminology here as a result of #355 (something I'd postponed in that PR). Writing a clear help snippet for these pattern options will be much easier and less inconsistent/confusing once the larger terminology shift and doc updates happen. I should prioritize that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jameshadfield jameshadfield May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 totally forgot about this subtlety. Is this behaviour mirrored in what directories are bind-mounted into the docker container (when running locally)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Before #355 the directory given to nextstrain build ended up as /nextstrain/build in the container and /nextstrain/build was the fixed working directory for the container process.

After #355, the /nextstrain/build mount is either the directory given to nextstrain build or, if present, the closest ancestor of the given directory which contains a nextstrain-pathogen.yaml. The working directory for the container process is the /nextstrain/build path corresponding to the directory given to nextstrain build.

@tsibley tsibley merged commit 7de3895 into master May 29, 2024
43 checks passed
@tsibley tsibley deleted the trs/build/exclude-from-upload branch May 29, 2024 23:06
@tsibley
Copy link
Member Author

tsibley commented May 29, 2024

This feature will be in the being-released-as-we-speak 8.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[batch] exclude directories for upload
3 participants