Caching Prototype #382

pfistfl · 2020-04-01T23:15:41Z

See caching.md

mb706 · 2020-04-02T07:20:53Z

My thoughts about things to consider, in random order:

with some operations it may make more sense to save just the $state and not the result. Then during $train() the caching mechanism can set the state from cache and call $.predict().
PipeOps should contain metadata about whether they are deterministic or not, and whether their .train() and .predict() results are the same whenever the input to both is the same (use common vs. separate cache)
caching in mlrCPO was a wrapper-PipeOp, we could also have that here. Pro: For multiple operations only the last output needs to be saved; makes the configuration of different caching mechanisms easier. Cons: We get the drawbacks of wrapping: the graph structure gets obscured. Also when wrapping multiple operations and just one of them is nondeterministic everything falls apart. We may want a ppl() function that wraps a graph optimally so that linear deterministic segments are cached together and only the output of the last PipeOp is kept. (Also works for arbitrary Graphs).

pfistfl · 2020-04-02T07:53:47Z

I added your comments and some responses.

pat-s · 2020-04-19T09:02:51Z

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

pfistfl · 2020-04-19T09:59:13Z

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

This would actually be broader then doing it in mlr3.
As every step in the pipeline is potentially cached, this includes

learners
filters
data transform pipeops

The only drawback would then be that in order to benefit from caching, those would need to be part of a pipeline, which they should be anyways in most cases.

benchmark() would then be cached by wrapping each learner inside a GraphLearner,
i.e.
lrns = map(lrns, function(lrn) GraphLearner$new(po(lrn)))

pat-s · 2020-04-19T10:42:20Z

OK. I again want to raise awareness that people who want to use mlr3 without pipelines should also profit from caching. For example, when filtering users should be able to profit from caching but this would require adding a per-package caching implementation then (there might be more ext packages besides filters). This one could potentially conflict with the pipelines caching.

Also, did you look how drake does this? In the end its also a workflow package that aims to cache steps within the "Graph" and detect if there are steps in the cache that do not need to be rerun.
Just trying to potentially save your time for tasks that might reinvent the wheel - even though mlr3pipelines might be completely different to how drake does it.

pfistfl · 2020-04-19T15:24:56Z

This one could potentially conflict with the pipelines caching.

In general, different levels of caching should not interfere, the worst case I could imagine would be to cache things twice, i.e. PipeOp and Filter cache their results. This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

people who want to use mlr3 without pipelines should also profit from caching.

I can not really judge this, but I am not sure I agree. I agree that we should provide enough convenience functions to enable people to work without learning all the in's and out's of 'mlr3pipelines`

Currently it is as complicated as

flt("variance") %>>% lrn("classif.rpart")

to write a filtered learner.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines?
That would only be the case when you do train/test split and filtering manually?

I have not looked at drake too deep but the current caching implementation that covers everything mlr3pipelnes needs has < 40 lines.

pat-s · 2020-04-19T21:30:06Z

This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

Yeah or maybe to rely on the lower-level caching implementation if it exists.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines?
That would only be the case when you do train/test split and filtering manually?

Yeah maybe it does not make sense and mlr3pipelines is a "must-use" in the whole game. For now I've only done toy benchmarks without any wrapped learners - all of these are still written in the old mlr.
And I can probably also combine drake and mlr3pipelines, maybe even both caching approaches.

Maybe you find mlr3-learnerdrake interesting. We/I should extend it with a pipelines example.

pfistfl · 2020-04-30T12:05:10Z

Review Bernd:

Should PipeOp Ids be cached
Release this in a separate release
Add blogpost, showcase results
Thoroughly testride

mb706 · 2020-06-05T15:54:38Z

R/PipeOp.R

+    },
+    stochastic = function(val) {
+      if (!missing(val)) {
+        private$.stochastic = assert_subset(val, c("train", "predict"))


shouldn't this be read-only and set during initialization?

Timo-Ko · 2024-05-16T07:01:46Z

Hi folks, we have a use case in mlr3 where caching inside mlr3pipelines would be super useful:
We have a graph learner that first imputes missing values and then an auto-tuned RF: lrn_rf_po = po("imputeoor") %>>% rf_tuned
Now, we would like to access the imputed values from our pipe.
Do you happen to have any updates / working code for the caching?
Thanks a lot in advance!

pfistfl added 3 commits April 1, 2020 19:53

Initial draft caching

904d164

Initial tests

2f9253c

Caching md

b1b5679

more comments

7557694

pfistfl added 2 commits April 2, 2020 10:08

Move caching to graph_reduce

cf80785

Add current state

27e84f3

pfistfl added Status: Needs Discussion We still need to think about what the solution should look like Status: Review Needed labels Apr 2, 2020

pfistfl added 12 commits April 3, 2020 14:04

Merge branch 'master' into caching_playground

f8147cd

Document first caching draft

86529b9

cached_pipeo_eval_fun

259b730

Errors in AB

3ef805f

Bug

7787088

Disable tests

416ae54

Add caching

bbad9d1

Test ABs

1287e77

Docs and finish up tests

5b58cb7

Docs and finish up tests

eef1240

Fix tests

639dc11

fix up proxy

4b4b2be

pfistfl removed the Status: Needs Discussion We still need to think about what the solution should look like label Apr 3, 2020

pfistfl added 4 commits April 4, 2020 01:00

polish docs

fd147c4

move R.cache to suggests, cache tests to tempfile

12cae42

no on exit

f096252

move to imports

b5f18f1

mllg self-assigned this Apr 4, 2020

pfistfl added 2 commits April 6, 2020 11:42

Disable caching tests

336748c

Merge remote-tracking branch 'origin' into caching_playground

a99f939

pfistfl added 5 commits April 16, 2020 17:33

re-enable tests

2884fcd

improve comments, cache tests tempfile

d1c7d09

fix caching tempdir issues

2547fb0

unlink testdir properly

2b75211

Merge remote-tracking branch 'origin' into caching_playground

2f8d118

pfistfl mentioned this pull request Apr 19, 2020

Caching #16

Open

mb706 reviewed Jun 5, 2020

View reviewed changes

mb706 mentioned this pull request Oct 19, 2020

Caching mlr-org/mlr3filters#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching Prototype #382

Caching Prototype #382

pfistfl commented Apr 1, 2020 •

edited

Loading

mb706 commented Apr 2, 2020

pfistfl commented Apr 2, 2020

pat-s commented Apr 19, 2020

pfistfl commented Apr 19, 2020

pat-s commented Apr 19, 2020

pfistfl commented Apr 19, 2020

pat-s commented Apr 19, 2020

pfistfl commented Apr 30, 2020

mb706 Jun 5, 2020

Timo-Ko commented May 16, 2024

Caching Prototype #382

Are you sure you want to change the base?

Caching Prototype #382

Conversation

pfistfl commented Apr 1, 2020 • edited Loading

mb706 commented Apr 2, 2020

pfistfl commented Apr 2, 2020

pat-s commented Apr 19, 2020

pfistfl commented Apr 19, 2020

pat-s commented Apr 19, 2020

pfistfl commented Apr 19, 2020

pat-s commented Apr 19, 2020

pfistfl commented Apr 30, 2020

mb706 Jun 5, 2020

Choose a reason for hiding this comment

Timo-Ko commented May 16, 2024

pfistfl commented Apr 1, 2020 •

edited

Loading