Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching Prototype #382

Open
wants to merge 29 commits into
base: master
Choose a base branch
from
Open

Caching Prototype #382

wants to merge 29 commits into from

Conversation

pfistfl
Copy link
Sponsor Member

@pfistfl pfistfl commented Apr 1, 2020

@mb706
Copy link
Collaborator

mb706 commented Apr 2, 2020

My thoughts about things to consider, in random order:

  • with some operations it may make more sense to save just the $state and not the result. Then during $train() the caching mechanism can set the state from cache and call $.predict().
  • PipeOps should contain metadata about whether they are deterministic or not, and whether their .train() and .predict() results are the same whenever the input to both is the same (use common vs. separate cache)
  • caching in mlrCPO was a wrapper-PipeOp, we could also have that here. Pro: For multiple operations only the last output needs to be saved; makes the configuration of different caching mechanisms easier. Cons: We get the drawbacks of wrapping: the graph structure gets obscured. Also when wrapping multiple operations and just one of them is nondeterministic everything falls apart. We may want a ppl() function that wraps a graph optimally so that linear deterministic segments are cached together and only the output of the last PipeOp is kept. (Also works for arbitrary Graphs).

@pfistfl
Copy link
Sponsor Member Author

pfistfl commented Apr 2, 2020

I added your comments and some responses.

@pfistfl pfistfl added Status: Needs Discussion We still need to think about what the solution should look like Status: Review Needed labels Apr 2, 2020
@pfistfl pfistfl removed the Status: Needs Discussion We still need to think about what the solution should look like label Apr 3, 2020
@mllg mllg self-assigned this Apr 4, 2020
@pfistfl pfistfl mentioned this pull request Apr 19, 2020
@pat-s
Copy link
Member

pat-s commented Apr 19, 2020

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

@pfistfl
Copy link
Sponsor Member Author

pfistfl commented Apr 19, 2020

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

This would actually be broader then doing it in mlr3.
As every step in the pipeline is potentially cached, this includes

  • learners
  • filters
  • data transform pipeops

The only drawback would then be that in order to benefit from caching, those would need to be part of a pipeline, which they should be anyways in most cases.

benchmark() would then be cached by wrapping each learner inside a GraphLearner,
i.e.
lrns = map(lrns, function(lrn) GraphLearner$new(po(lrn)))

@pat-s
Copy link
Member

pat-s commented Apr 19, 2020

OK. I again want to raise awareness that people who want to use mlr3 without pipelines should also profit from caching. For example, when filtering users should be able to profit from caching but this would require adding a per-package caching implementation then (there might be more ext packages besides filters). This one could potentially conflict with the pipelines caching.

Also, did you look how drake does this? In the end its also a workflow package that aims to cache steps within the "Graph" and detect if there are steps in the cache that do not need to be rerun.
Just trying to potentially save your time for tasks that might reinvent the wheel - even though mlr3pipelines might be completely different to how drake does it.

@pfistfl
Copy link
Sponsor Member Author

pfistfl commented Apr 19, 2020

This one could potentially conflict with the pipelines caching.

In general, different levels of caching should not interfere, the worst case I could imagine would be to cache things twice, i.e. PipeOp and Filter cache their results. This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

people who want to use mlr3 without pipelines should also profit from caching.

I can not really judge this, but I am not sure I agree. I agree that we should provide enough convenience functions to enable people to work without learning all the in's and out's of 'mlr3pipelines`

Currently it is as complicated as

flt("variance") %>>% lrn("classif.rpart")

to write a filtered learner.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines?
That would only be the case when you do train/test split and filtering manually?

I have not looked at drake too deep but the current caching implementation that covers everything mlr3pipelnes needs has < 40 lines.

@pat-s
Copy link
Member

pat-s commented Apr 19, 2020

This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

Yeah or maybe to rely on the lower-level caching implementation if it exists.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines?
That would only be the case when you do train/test split and filtering manually?

Yeah maybe it does not make sense and mlr3pipelines is a "must-use" in the whole game. For now I've only done toy benchmarks without any wrapped learners - all of these are still written in the old mlr.
And I can probably also combine drake and mlr3pipelines, maybe even both caching approaches.

Maybe you find mlr3-learnerdrake interesting. We/I should extend it with a pipelines example.

@pfistfl
Copy link
Sponsor Member Author

pfistfl commented Apr 30, 2020

Review Bernd:

  • Should PipeOp Ids be cached
  • Release this in a separate release
  • Add blogpost, showcase results
  • Thoroughly testride

},
stochastic = function(val) {
if (!missing(val)) {
private$.stochastic = assert_subset(val, c("train", "predict"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be read-only and set during initialization?

@mb706 mb706 mentioned this pull request Oct 19, 2020
@Timo-Ko
Copy link

Timo-Ko commented May 16, 2024

Hi folks, we have a use case in mlr3 where caching inside mlr3pipelines would be super useful:
We have a graph learner that first imputes missing values and then an auto-tuned RF: lrn_rf_po = po("imputeoor") %>>% rf_tuned
Now, we would like to access the imputed values from our pipe.
Do you happen to have any updates / working code for the caching?
Thanks a lot in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants