Centralize osf data links #304

jerlich · 2020-07-14T14:55:51Z

It seems like poor design to have the links to datasets peppered throughout the tutorials. One solution would be a csv with columns "dataset", "url"

Then if we change the url for a dataset it would just be in one spot. It also means if the China team wants to store our data in China we can just change one file.

Would you be ok with this PR? if so I can work on the changes.

mwaskom · 2020-07-14T15:02:10Z

Seems like an ok idea, but we have a rule the tutorials can't use pandas, so downloading the centralized csv file and parsing it is going to be a pain.

I'm +0 on it. Good in principle, but work on the tutorials is fairly distributed right now, and it has the potential to cause conflicts. For our workflow, I don't think it saves much effort, though I appreciate how it might make your life easier.

jerlich · 2020-07-14T15:03:42Z

why was pandas banned?

jerlich · 2020-07-14T15:06:34Z

Is a utils.datafetch module allowed that would have a function that took a dataset name and returned an NDARRAY (or whatever type is specified in the csv)?

txperl · 2020-07-14T15:11:34Z

I agree. I think this will save a lot of time for localization.

There are only a dozen parts (about 10-20) in the tutorials that involve data sets, and it doesn’t require lots of time or complex logic to change.

In addition, some data sets are downloaded directly and used, while others are preferentially loaded locally. I think this is not so good.

mwaskom · 2020-07-14T15:11:50Z

why was pandas banned?

not my decision, but the scope of libraries introduced to the students was intentionally limited to numpy/scipy/matplotlib/scikit-learn to avoid overwhelming them. this is most easily enforced by not having pandas in the environment file that the tests run against.

Is a utils.datafetch module allowed that would have a function that took a dataset name and returned an NDARRAY

We've discussed a few times implementing a centralized NMA helper function library, but those conversations never converged. With the colab workflow, you'd need to pip install it in each notebook.

So either we need to (A) copy-paste a little helper function everywhere we want to use this pattern, or (B) stand up a centralized utility library while the course is live. I'm -1 on (B), and concerned about the potential of subtle bugs in (A).

jerlich · 2020-07-14T15:15:34Z

got it. I would have voted for a central NMA helper module in pypi.

ok, we will stick with our find/replace scripts.

mwaskom · 2020-07-14T15:18:26Z

got it. I would have voted for a central NMA helper module in pypi.

Agreed, I was a proponent (Although we'd be iterating on that just as fast as on the tutorials themselves, so pushing releases to pypi would have been a limiting factor, and versioning could have been a nightmare. You can point pip at github, but we'd have needed a whole separate CI pipeline, etc etc). Still strongly in favor of bringing that online for version 2.0, once we have a better sense of common patterns across the tutorials. Robust/abstract data loading is an obvious one.

jerlich · 2020-07-14T16:28:48Z

There is a solution that doesn't (I think) have any of the problems you mentioned @mwaskom . Instead of putting the code in a module that needs to run in each notebook, we just have a public service like https://dataurl.neuromatchacademy.org?dataset=steinmetz&location=us and https://dataurl.neuromatchacademy.org?dataset=steinmetz&location=china that returns the URL of where to fetch the data. The service can even point back to a CSV file in the course-content.

it is a bad solution in that the service is a new dependency. But it is so simple, it could just be an nginx configuration.

spirosChv added the Infrastructure Things related to the workflows (e.g., book generation, process notebooks) label Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize osf data links #304

Centralize osf data links #304

jerlich commented Jul 14, 2020

mwaskom commented Jul 14, 2020

jerlich commented Jul 14, 2020

jerlich commented Jul 14, 2020

txperl commented Jul 14, 2020

mwaskom commented Jul 14, 2020 •

edited

Loading

jerlich commented Jul 14, 2020

mwaskom commented Jul 14, 2020 •

edited

Loading

jerlich commented Jul 14, 2020

Centralize osf data links #304

Centralize osf data links #304

Comments

jerlich commented Jul 14, 2020

mwaskom commented Jul 14, 2020

jerlich commented Jul 14, 2020

jerlich commented Jul 14, 2020

txperl commented Jul 14, 2020

mwaskom commented Jul 14, 2020 • edited Loading

jerlich commented Jul 14, 2020

mwaskom commented Jul 14, 2020 • edited Loading

jerlich commented Jul 14, 2020

mwaskom commented Jul 14, 2020 •

edited

Loading

mwaskom commented Jul 14, 2020 •

edited

Loading