Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove srcref after leanification #89

Merged
merged 4 commits into from
Sep 19, 2023
Merged

fix: remove srcref after leanification #89

merged 4 commits into from
Sep 19, 2023

Conversation

sebffischer
Copy link
Sponsor Member

@sebffischer sebffischer commented Sep 14, 2023

When compiling R with --with-keep.source, serialized objects were gigantic (and dependent on the loaded packages), see this issue: #88

I tested that when installing mlr3 with --with-keep.source with this version of mlr3misc, the problem disappears.
This also caused the failed workflows in the mlr3book

@berndbischl @mllg @mb706

R/leanify.R Outdated Show resolved Hide resolved
@mb706
Copy link
Contributor

mb706 commented Sep 14, 2023

good find, hadn't thought about the problem that srcref can take up lots of memory

@sebffischer
Copy link
Sponsor Member Author

sebffischer commented Sep 14, 2023

I still don't understand why the measures object size depends on the packages that are loaded, do you have an idea why?

e.g. when saving a learner state (when having installed mlr3 with --with-keep.source) the result returned by pryr::object_size() depends on whether mlr3tuning (or other mlr3 packages) are loaded or not.

@sebffischer
Copy link
Sponsor Member Author

Consider:

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3)

task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")

saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 19.49 MB

vs

library(mlr3)

task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")

saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 4.00 MB

@mb706
Copy link
Contributor

mb706 commented Sep 14, 2023

It gets worse:

library("mlr3")
task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")

saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 4.00 MB
x$train_task$help
#> function() {
#>       open_help(self$man)
#>     }
#> <environment: 0x563addc97780>
pryr::object_size(x)
#> 1.08 MB

probably some kind of promise being evaluated

@mb706
Copy link
Contributor

mb706 commented Sep 14, 2023

The srcfile attribute of the srcref attribute is an environment that contains the field lines, which is a promise:

substitute(lines, attr(attr(x$train_task$help, "srcref"), "srcfile")$original)
#> lazyLoadDBfetch(c(344L, 114431L), datafile, compressed, envhook)

@sebffischer
Copy link
Sponsor Member Author

Thanks! so the promise ensures that some object (whose size depends on the loaded packages) is part of the rds file and once the promise is evaluated this data is freed and the object size changes, correct?

@mb706
Copy link
Contributor

mb706 commented Sep 14, 2023

What is happening is that the srcref is itself a short vector of line numbers / file positions (or some other kind of index), together with an attribute that contains a representation of the content of the source file. This representation is an environment that has the $lines member, which is a promise -- it is only evaluated once someone accesses it. This happens e.g. when printing a function, which uses the srcref together with the source file content to print the text of the function. Before printing, the $lines field is an unevaluated promise, containing the expression and (large) environment in which it is evaluated. After printing, $lines contains the actual result (I guess individual lines of source files) and its eval.env is nulled.

The offender here is the envenv entry of the environment of the $lines promise:

(Don't know how to inspect the promise's environment with base R, and even pryr seems to be a bit convoluted, since it can only inspect symbols, not members of environments?)

prominfo <- evalq(pi(lines), list(pi = pryr::promise_info), attr(attr(x$train_task$help, "srcref"), "srcfile")$original)

prom_env <- prominfo$env

names(prom_env$envenv)
#>  [1] "env::150" "env::151" "env::152" "env::10"  "env::157" "env::13"
#> .......

It appears to contain lots of environments. Maybe they are all environments that can be accessed from within mlr3, or maybe they are all the environments loaded in total? In the latter case, the influence of loading other packages would be obvious, in the former the influence would be indirect, since loading other packages makes the dictionaries, like mlr_learners, bigger.

Interestingly, printing a single function from within x makes the whole object smaller, since the "srcfile" attribute is an environment that is shared between all functions in the R6 object (and all other functions loaded from the package, or loaded from the same RDS-file). There is only a single $lines promise that can be triggered.

x = readRDS(pth)
y = readRDS(pth)
pryr::object_size(x)
#> 4.00 MB
pryr::object_size(y)
#> 4.00 MB
x$train_task$help
#> function() {
#>       open_help(self$man)
#>     }
#> <environment: 0x563addc97780>
pryr::object_size(x)
#> 1.08 MB
pryr::object_size(y)
#> 4.00 MB

It is also possible to trigger the lines promise before saving, which removes the offending environment and makes all objects the same size again. The following gives 1.08 MB for me, with and without the library(mlr3verse) line.

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3)

task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")
learner$state$train_task$help
#> function() {
#>       open_help(self$man)
#>     }
#> <environment: 0x563addc97780>
saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 1.08 MB

@mllg mllg merged commit 5343fc5 into main Sep 19, 2023
3 checks passed
@mllg mllg deleted the fix/leanify branch September 19, 2023 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants