Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed-up read.snapshot #101

Merged
merged 2 commits into from
Oct 8, 2024
Merged

speed-up read.snapshot #101

merged 2 commits into from
Oct 8, 2024

Conversation

orichters
Copy link
Contributor

  • using data.table::fread
> t <- Sys.time(); d <- as.quitte("/p/projects/remind/users/oliverr/data/NGFS5-S14_2024-07-18-snapshot_R5.csv"); print(Sys.time() - t)
Time difference of 1.512139 mins
> t <- Sys.time(); d <- read.snapshot("/p/projects/remind/users/oliverr/data/NGFS5-S14_2024-07-18-snapshot_R5.csv"); print(Sys.time() - t)
Time difference of 1.474763 secs

Copy link

@laurinks laurinks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Oliver! Time improvements look very promising. I have made no separate tests, but will be happy to check performance once it is merged.

Copy link
Member

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q 0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to include a new dependency for that.

t <- Sys.time(); d <- read.quitte("/p/projects/remind/users/oliverr/data/NGFS5-S14_2024-07-18-snapshot_R5.csv", check.duplicates = FALSE, drop.na = TRUE); print(Sys.time() - t)
Time difference of 2.771989 secs   

@orichters
Copy link
Contributor Author

I wasn't aware that the duplicate check was so time-consuming. Still, reading larger snapshots, this setup seems to be much faster

> t <- Sys.time(); d <- read.quitte("/p/projects/piam/scenariomip/scenario_explorer/data/scenarios_scenariomip_2024-10-02.csv", check.duplicates=F); print(Sys.time() - t)                   
|==================================================================| 100% 437 MB
Time difference of 1.458424 mins
> t <- Sys.time(); d <- read.snapshot("/p/projects/piam/scenariomip/scenario_explorer/data/scenarios_scenariomip_2024-10-02.csv"); print(Sys.time() - t)
Time difference of 10.31954 secs

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Oct 8, 2024

> devtools::load_all()
ℹ Loading quitte
> f <- '/p/projects/piam/scenariomip/scenario_explorer/data/scenarios_scenariomip_2024-10-02.csv'
> bench::mark(
+ `read.snapshot` = { read.snapshot(f); TRUE },
+ `read.quitte`   = { read.quitte(f, check.duplicates = FALSE, drop.na = TRUE); TRUE })
# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 read.snapshot 11.3s  11.3s    0.0885      12GB    0.708     1     8      11.3s
2 read.quitte   23.6s  23.6s    0.0424    17.2GB    0.170     1     4      23.6s
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

Pretty much a constant factor (if you use all the relevant arguments).
And I would expect these files to be read once and then cached.

@orichters orichters merged commit cffc818 into pik-piam:master Oct 8, 2024
2 checks passed
@orichters
Copy link
Contributor Author

Ok, hope that is sufficiently fast for the scenario mip people. Thanks for your intervention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants