Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize osf data links #304

Open
jerlich opened this issue Jul 14, 2020 · 8 comments
Open

Centralize osf data links #304

jerlich opened this issue Jul 14, 2020 · 8 comments
Labels
Infrastructure Things related to the workflows (e.g., book generation, process notebooks)

Comments

@jerlich
Copy link
Contributor

jerlich commented Jul 14, 2020

It seems like poor design to have the links to datasets peppered throughout the tutorials. One solution would be a csv with columns "dataset", "url"

Then if we change the url for a dataset it would just be in one spot. It also means if the China team wants to store our data in China we can just change one file.

Would you be ok with this PR? if so I can work on the changes.

@mwaskom
Copy link
Contributor

mwaskom commented Jul 14, 2020

Seems like an ok idea, but we have a rule the tutorials can't use pandas, so downloading the centralized csv file and parsing it is going to be a pain.

I'm +0 on it. Good in principle, but work on the tutorials is fairly distributed right now, and it has the potential to cause conflicts. For our workflow, I don't think it saves much effort, though I appreciate how it might make your life easier.

@jerlich
Copy link
Contributor Author

jerlich commented Jul 14, 2020

why was pandas banned?

@jerlich
Copy link
Contributor Author

jerlich commented Jul 14, 2020

Is a utils.datafetch module allowed that would have a function that took a dataset name and returned an NDARRAY (or whatever type is specified in the csv)?

@txperl
Copy link

txperl commented Jul 14, 2020

I agree. I think this will save a lot of time for localization.

There are only a dozen parts (about 10-20) in the tutorials that involve data sets, and it doesn’t require lots of time or complex logic to change.

In addition, some data sets are downloaded directly and used, while others are preferentially loaded locally. I think this is not so good.

@mwaskom
Copy link
Contributor

mwaskom commented Jul 14, 2020

why was pandas banned?

not my decision, but the scope of libraries introduced to the students was intentionally limited to numpy/scipy/matplotlib/scikit-learn to avoid overwhelming them. this is most easily enforced by not having pandas in the environment file that the tests run against.

Is a utils.datafetch module allowed that would have a function that took a dataset name and returned an NDARRAY

We've discussed a few times implementing a centralized NMA helper function library, but those conversations never converged. With the colab workflow, you'd need to pip install it in each notebook.

So either we need to (A) copy-paste a little helper function everywhere we want to use this pattern, or (B) stand up a centralized utility library while the course is live. I'm -1 on (B), and concerned about the potential of subtle bugs in (A).

@jerlich
Copy link
Contributor Author

jerlich commented Jul 14, 2020

got it. I would have voted for a central NMA helper module in pypi.

ok, we will stick with our find/replace scripts.

@mwaskom
Copy link
Contributor

mwaskom commented Jul 14, 2020

got it. I would have voted for a central NMA helper module in pypi.

Agreed, I was a proponent (Although we'd be iterating on that just as fast as on the tutorials themselves, so pushing releases to pypi would have been a limiting factor, and versioning could have been a nightmare. You can point pip at github, but we'd have needed a whole separate CI pipeline, etc etc). Still strongly in favor of bringing that online for version 2.0, once we have a better sense of common patterns across the tutorials. Robust/abstract data loading is an obvious one.

@jerlich
Copy link
Contributor Author

jerlich commented Jul 14, 2020

There is a solution that doesn't (I think) have any of the problems you mentioned @mwaskom . Instead of putting the code in a module that needs to run in each notebook, we just have a public service like https://dataurl.neuromatchacademy.org?dataset=steinmetz&location=us and https://dataurl.neuromatchacademy.org?dataset=steinmetz&location=china that returns the URL of where to fetch the data. The service can even point back to a CSV file in the course-content.

it is a bad solution in that the service is a new dependency. But it is so simple, it could just be an nginx configuration.

@spirosChv spirosChv added the Infrastructure Things related to the workflows (e.g., book generation, process notebooks) label Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Things related to the workflows (e.g., book generation, process notebooks)
Projects
None yet
Development

No branches or pull requests

4 participants