Functions to manage drake cache location #66

diazrenata · 2019-02-12T16:32:47Z

Just something to think about?

If it's in the cloud, I think this allows us to pass it around without re-running targets. However, it becomes sensitive to drake version changes (which is how it came to my attention). Right now it is in the .gitignore, but a version of it lives on GitHub, so I'm not sure what our consensus on it is?

The text was updated successfully, but these errors were encountered:

ha0ye · 2019-02-12T16:51:16Z

It ballooned up to nearly half a gig with the BBS analysis included, so I think we're going to exhaust piggyback very quickly. I think this needs to be setup as a larger technical issue to:

designated location to download the cache file
authentication to upload/update the cache file
not sure about logistics for this -- UFL servers will probably restrict to UFL just to access the cache file
maybe also check what's causing the BBS output to be so large (probably a separate issue to address the pre-processing there?)

ethanwhite · 2019-02-15T02:49:11Z

I like the idea of a designated location to download the cache file, with only a limited numbers of users able to upload an updated version.

I did look into directly connected to an online PostgreSQL database, but since Drake doesn't support a "read only" approach to the cache that won't work.

Seems like there are three reasonable choices for storing the online cache at the moment:

Put it on Serenity. Lab members already have permissions (or can get them) to update files on Serenity. While the server is currently only available from UF's network that can be changed. It creates a little more security exposure, but not much.
We have a DigitalOcean droplet that we are using it serve the Portal weather station data. It already has a webserver running for this purpose and at the current price we get 25 GB of storage & 1000 GB/month of transfer, which should be plenty for current usage. No extra security exposure, no need to set up the web server, but one more place to keep track of.
Use Zenodo. Size limit is currently 50 GB, which ought to be sufficient. Simple web interface for updates, so no need for ssh style file transfers.

I'm happy to set up whatever you all think is best and set up permissions for anyone who needs them.

One complexity that we'll need to deal with in implementation is what to do after an initial download of the cache file. The user will then run the pipeline, potentially with changes, which will update the cache file. On subsequent runs do we download an updated cache file or keep the users version? Do we provide a function that lets the user optionally update to the newest remote cache?

Finally, apologies for being negative about this today. I was overthinking things, which I wouldn't have done if I was paying attention to the issues in this repo since this one laid out what you were looking for nice and clearly.

ha0ye · 2019-02-15T13:50:52Z

Serenity would likely be suitable for our needs, and I'm not sure we need to enable off-UF access, if we figure out the right workflow. I still lean towards Zenodo, even though it's a bit of a hacky use of their service. That sounds like it would make the MATSS package more easily portable for other users, since it's a lower barrier for individuals to set up a zenodo account and link up a cache there if so desired.

There's even an R package, though the last commit was 4 years ago. 🙀

As for immediate needs... we've done some refactoring of how retriever data gets loaded in, which is keeping the current cache size around ~100 MB.

Agree on discussing workflow issues -- I need to think more about goals here, as well as how to handle testing.

diazrenata · 2019-03-14T16:56:57Z

store cache on either hipergator or serenity
to use for a user:

install most recent package
access/dl cache from storage
inspect results. if new runs needed:
3a. start a hipergator run
3b. hg installs most recent MATSS
3c. hg gets cache from storage
3d. hg runs
3e. hg sends new cache back to storage
user can re-access cache from storage

diazrenata · 2019-04-23T19:43:07Z

See weecology/MATSS-LDATS#23.

Currently we have a working solution using the hipergator. We will want a long-term solution for external users.

ha0ye · 2019-04-23T19:44:42Z

Long-term, I think we want functions to be able to push and pull a drake cache from a remotely deployed MATSS instance, but this is not a high-priority feature

ha0ye changed the title ~~Keep drake cache in GitHub or local?~~ Drake cache management (location, interface, etc.) Feb 21, 2019

ha0ye added this to the Spring Semester Goals milestone Mar 21, 2019

ha0ye added the infrastructure how MATSS runs label Mar 21, 2019

ha0ye added the HPC related to deployment on high-performance computing label Apr 23, 2019

ha0ye changed the title ~~Drake cache management (location, interface, etc.)~~ Functions to manage drake cache location Apr 23, 2019

ha0ye added the enhancement New feature or request label Apr 23, 2019

ha0ye mentioned this issue Apr 23, 2019

Summer 2019 Roadmap #115

Closed

15 tasks

ha0ye mentioned this issue Aug 2, 2019

Roadmap (infrastructure paper) #147

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions to manage drake cache location #66

Functions to manage drake cache location #66

diazrenata commented Feb 12, 2019

ha0ye commented Feb 12, 2019

ethanwhite commented Feb 15, 2019

ha0ye commented Feb 15, 2019

diazrenata commented Mar 14, 2019

diazrenata commented Apr 23, 2019

ha0ye commented Apr 23, 2019

Functions to manage drake cache location #66

Functions to manage drake cache location #66

Comments

diazrenata commented Feb 12, 2019

ha0ye commented Feb 12, 2019

ethanwhite commented Feb 15, 2019

ha0ye commented Feb 15, 2019

diazrenata commented Mar 14, 2019

diazrenata commented Apr 23, 2019

ha0ye commented Apr 23, 2019