Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input files packaging #510

Merged

Conversation

andrii-i
Copy link
Collaborator

@andrii-i andrii-i commented Apr 24, 2024

When working with notebooks, users frequently import and use additional files such as datasets, images, and scripts inside their notebook's cells. Providing support for packaging such files would ensure that notebooks can have all essential resources available when executed as a job. This would make Jupyter Scheduler more flexible and able to accommodate more types or workflows providing better value to people who use it.

This PR adds an option to package input folder (folder where input notebook is located) and all nested files and sub-folders within it during the job or job definition creation.

In terms of features, this is a subset of PR #500. This PR does not automatically download output files to output folder when job runs and therefore has no need need to schedule downloads from multiple processes and components that would manage it (DownloadRunner and DownloadManager from #500). Besides making this PR more focused in terms of functionality, this makes changes introduced by this PR non-breaking.

When package input folder option is active:

  • All files and sub-folders within input folder are copied to the staging area together with the snapshot of an input notebook
  • Original organization of files and folders (file tree) is maintained, input notebook is able to access them via relative file paths when it is ran as a job
  • After downloading job files, users can access the output folder by clicking "Files" shortcut under "Output files" in the list view (it will open the folder in JupyterLab filebrowser).
  • For jobs, packaged input files and side effects are tracked in Job.packaged_files, if any of them is deleted from output folder, user gets an option to re-downloaded them via UI (matches existing behavior for snapshot of the input notebook and output files)
  • For job definitions, packaged input files and side effects are tracked in JobDefinition.packaged_files
  • Modifies notebook / job execution logic so that kernel executing the notebook always operates in the same directory as the notebook being executed (cwd parameter of the ExecutePreprocessor). Additionally pass intended path of execution context to preprocessor via metadata {"metadata": {"path": notebook_dir}} argument of the preprocessor call.
  • Side effects of running an input notebook are added to Job.packaged_files and copied to the output folder together with other files

Fixes #407

Before:
image
image

After:
Screenshot 2024-04-25 at 11 31 38 AM

Screenshot 2024-04-25 at 11 31 54 AM

Screenshot 2024-04-02 at 1 46 20 PM

Re-download option when any of the files is deleted from the output folder:
Screenshot 2024-04-02 at 1 47 36 PM

image

@andrii-i andrii-i added the enhancement New feature or request label Apr 24, 2024
@andrii-i
Copy link
Collaborator Author

Kicking CI

@andrii-i andrii-i closed this Apr 24, 2024
@andrii-i andrii-i reopened this Apr 24, 2024
@andrii-i andrii-i marked this pull request as ready for review April 25, 2024 00:10
@andrii-i
Copy link
Collaborator Author

Kicking CI

@andrii-i andrii-i closed this Apr 25, 2024
@andrii-i
Copy link
Collaborator Author

Kicking CI

@andrii-i andrii-i closed this Apr 26, 2024
@andrii-i andrii-i reopened this Apr 26, 2024
@andrii-i
Copy link
Collaborator Author

Could you please update the screenshots and text in the docs to reflect this new option? I can help with the copy editing.

@JasonWeill updated "Submit the Create Job form" section of the Users page with a screenshot and text mentioning "Run job with input folder" option (code, readthedocs preview).

I tried different options like adding an example "Use this to, for example, access data files or images from notebook's cells." But ultimately what I have in the PR now works well and matches level of detail given in the paragraphs around it, would be interested to learn your opinion on this.

docs/users/index.md Outdated Show resolved Hide resolved
@andrii-i
Copy link
Collaborator Author

Kicking CI

@andrii-i andrii-i closed this Apr 26, 2024
@andrii-i andrii-i reopened this Apr 26, 2024
Copy link
Collaborator

@JasonWeill JasonWeill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good! Thanks for the great work on this.

@andrii-i andrii-i merged commit 4d7de94 into jupyter-server:main Apr 26, 2024
6 checks passed
@andrii-i andrii-i deleted the package_input_files_no_autodownload branch April 26, 2024 19:24
andrii-i added a commit to andrii-i/jupyter-scheduler that referenced this pull request Apr 29, 2024
…ger) (jupyter-server#510)

* package input files and folders (backend)

* package input files and folders (frontend)

* remove "input_dir" from staging_paths dict

* ensure execution context matches the notebook directory

* update snapshots

* copy staging folder to output folder after job runs (SUCESS or FAILURE)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* copy staging folder and side effects to output after job runs, track and redownload files

* remove staging to output copying logic from executor

* refactor output files creation logic into a separate function for clarity

* Fix job definition data model

* add packaged_files to JobDefinition and DescribeJobDefinition model

* fix existing pytests

* clarify FilesDirectoryLink title

* Dynamically display input folder in the checkbox text

* display packageInputFolder parameter as 'Files included'

* use helper text with input directory for 'include files' checkbox

* Update Playwright Snapshots

* add test side effects accountability test for execution manager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use "Run job with input folder" for packageInputFolder checkbox text

* Update Playwright Snapshots

* Use "Ran with input folder" in detail page

* Update src/components/input-folder-checkbox.tsx

Co-authored-by: Jason Weill <[email protected]>

* fix lint error

* Update Playwright Snapshots

* Update existing screenshots

* Update "Submit the Create Job" section mentioning “Run job with input folder” option

* Update docs/users/index.md

Co-authored-by: Jason Weill <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update src/components/input-folder-checkbox.tsx

Co-authored-by: Jason Weill <[email protected]>

* Update Playwright Snapshots

* Describe side effects behavior better

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jason Weill <[email protected]>
andrii-i added a commit that referenced this pull request Apr 30, 2024
…adManager) (#510)  (#512)

* Package input files (no autodownload, no multiprocessing DownloadManager) (#510)

* package input files and folders (backend)

* package input files and folders (frontend)

* remove "input_dir" from staging_paths dict

* ensure execution context matches the notebook directory

* update snapshots

* copy staging folder to output folder after job runs (SUCESS or FAILURE)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* copy staging folder and side effects to output after job runs, track and redownload files

* remove staging to output copying logic from executor

* refactor output files creation logic into a separate function for clarity

* Fix job definition data model

* add packaged_files to JobDefinition and DescribeJobDefinition model

* fix existing pytests

* clarify FilesDirectoryLink title

* Dynamically display input folder in the checkbox text

* display packageInputFolder parameter as 'Files included'

* use helper text with input directory for 'include files' checkbox

* Update Playwright Snapshots

* add test side effects accountability test for execution manager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use "Run job with input folder" for packageInputFolder checkbox text

* Update Playwright Snapshots

* Use "Ran with input folder" in detail page

* Update src/components/input-folder-checkbox.tsx

Co-authored-by: Jason Weill <[email protected]>

* fix lint error

* Update Playwright Snapshots

* Update existing screenshots

* Update "Submit the Create Job" section mentioning “Run job with input folder” option

* Update docs/users/index.md

Co-authored-by: Jason Weill <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update src/components/input-folder-checkbox.tsx

Co-authored-by: Jason Weill <[email protected]>

* Update Playwright Snapshots

* Describe side effects behavior better

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jason Weill <[email protected]>

* Update Playwright snapshots

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jason Weill <[email protected]>
@andrii-i andrii-i changed the title Package input files (no autodownload, no multiprocessing DownloadManager) Package input files Aug 21, 2024
@andrii-i andrii-i changed the title Package input files Input files packaging Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for packaging dependency files
2 participants