Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to manage drake cache location #66

Open
diazrenata opened this issue Feb 12, 2019 · 6 comments
Open

Functions to manage drake cache location #66

diazrenata opened this issue Feb 12, 2019 · 6 comments
Labels
enhancement New feature or request HPC related to deployment on high-performance computing infrastructure how MATSS runs

Comments

@diazrenata
Copy link
Member

Just something to think about?

If it's in the cloud, I think this allows us to pass it around without re-running targets. However, it becomes sensitive to drake version changes (which is how it came to my attention). Right now it is in the .gitignore, but a version of it lives on GitHub, so I'm not sure what our consensus on it is?

@ha0ye
Copy link
Member

ha0ye commented Feb 12, 2019

It ballooned up to nearly half a gig with the BBS analysis included, so I think we're going to exhaust piggyback very quickly. I think this needs to be setup as a larger technical issue to:

  • designated location to download the cache file
  • authentication to upload/update the cache file
  • not sure about logistics for this -- UFL servers will probably restrict to UFL just to access the cache file
  • maybe also check what's causing the BBS output to be so large (probably a separate issue to address the pre-processing there?)

@ethanwhite
Copy link
Member

I like the idea of a designated location to download the cache file, with only a limited numbers of users able to upload an updated version.

I did look into directly connected to an online PostgreSQL database, but since Drake doesn't support a "read only" approach to the cache that won't work.

Seems like there are three reasonable choices for storing the online cache at the moment:

  1. Put it on Serenity. Lab members already have permissions (or can get them) to update files on Serenity. While the server is currently only available from UF's network that can be changed. It creates a little more security exposure, but not much.
  2. We have a DigitalOcean droplet that we are using it serve the Portal weather station data. It already has a webserver running for this purpose and at the current price we get 25 GB of storage & 1000 GB/month of transfer, which should be plenty for current usage. No extra security exposure, no need to set up the web server, but one more place to keep track of.
  3. Use Zenodo. Size limit is currently 50 GB, which ought to be sufficient. Simple web interface for updates, so no need for ssh style file transfers.

I'm happy to set up whatever you all think is best and set up permissions for anyone who needs them.

One complexity that we'll need to deal with in implementation is what to do after an initial download of the cache file. The user will then run the pipeline, potentially with changes, which will update the cache file. On subsequent runs do we download an updated cache file or keep the users version? Do we provide a function that lets the user optionally update to the newest remote cache?

Finally, apologies for being negative about this today. I was overthinking things, which I wouldn't have done if I was paying attention to the issues in this repo since this one laid out what you were looking for nice and clearly.

@ha0ye
Copy link
Member

ha0ye commented Feb 15, 2019

Serenity would likely be suitable for our needs, and I'm not sure we need to enable off-UF access, if we figure out the right workflow. I still lean towards Zenodo, even though it's a bit of a hacky use of their service. That sounds like it would make the MATSS package more easily portable for other users, since it's a lower barrier for individuals to set up a zenodo account and link up a cache there if so desired.

There's even an R package, though the last commit was 4 years ago. 🙀

As for immediate needs... we've done some refactoring of how retriever data gets loaded in, which is keeping the current cache size around ~100 MB.

Agree on discussing workflow issues -- I need to think more about goals here, as well as how to handle testing.

@ha0ye ha0ye changed the title Keep drake cache in GitHub or local? Drake cache management (location, interface, etc.) Feb 21, 2019
@diazrenata
Copy link
Member Author

  • store cache on either hipergator or serenity
    to use for a user:
  1. install most recent package
  2. access/dl cache from storage
  3. inspect results. if new runs needed:
    3a. start a hipergator run
    3b. hg installs most recent MATSS
    3c. hg gets cache from storage
    3d. hg runs
    3e. hg sends new cache back to storage
  4. user can re-access cache from storage

@ha0ye ha0ye added this to the Spring Semester Goals milestone Mar 21, 2019
@ha0ye ha0ye added the infrastructure how MATSS runs label Mar 21, 2019
@diazrenata
Copy link
Member Author

See weecology/MATSS-LDATS#23.

Currently we have a working solution using the hipergator. We will want a long-term solution for external users.

@ha0ye ha0ye added the HPC related to deployment on high-performance computing label Apr 23, 2019
@ha0ye ha0ye changed the title Drake cache management (location, interface, etc.) Functions to manage drake cache location Apr 23, 2019
@ha0ye
Copy link
Member

ha0ye commented Apr 23, 2019

Long-term, I think we want functions to be able to push and pull a drake cache from a remotely deployed MATSS instance, but this is not a high-priority feature

@ha0ye ha0ye added the enhancement New feature or request label Apr 23, 2019
@ha0ye ha0ye mentioned this issue Apr 23, 2019
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HPC related to deployment on high-performance computing infrastructure how MATSS runs
Projects
None yet
Development

No branches or pull requests

3 participants