Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Anirban166 · 2024-10-03T19:38:23Z

.ci/atime/tests.R

codecov · 2024-10-03T19:48:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.62%. Comparing base (b4538a0) to head (13c4b1f).
Report is 4 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6555   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          79       79           
  Lines       14450    14450           
=======================================
  Hits        14251    14251           
  Misses        199      199

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.ci/atime/tests.R

github-actions · 2024-10-03T19:52:55Z

Generated via commit 13c4b1f

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	3 minutes and 33 seconds
Installing different package versions	7 minutes and 52 seconds
Running and plotting the test cases	2 minutes and 18 seconds

.ci/atime/tests.R

tdhock · 2024-10-03T20:59:24Z

your original test had 3 calls to forderv.

first in setup with retGrp default(=FALSE)
then in expr with retGrp=FALSE
then again in expr with retGrp=TRUE

I was wondering if we need both retGrp=TRUE and FALSE in expr?
If we have TRUE only (or FALSE only) I get this result (constant instead of linear for Fast)

I coded a for loop over doing retGrp=T or F in setup, and also in expr. (4 test cases total)
For three of these we see constant Fast (as above)

But for retGrp=T in setup then retGrp=F in expr, we see linear Fast, as in the plot below:

is this expected @jangorecki @MichaelChirico ?

looking at datatable.verbose=T in this case, for example like the code below, I see "using existing index" which to me suggests that this should be constant (not linear as observed), so is this a performance bug??

> set.seed(1);library(data.table);options(datatable.verbose=T);dt <- data.table(index = sample(N), values = sample(N));data.table:::forderv(dt, "index", retGrp = TRUE);cat("------\n");data.table:::forderv(dt, "index", retGrp = FALSE)
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
 [1]  4  5  7  2  6  9  3 10  1  8
attr(,"starts")
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"maxgrpn")
[1] 1
attr(,"anyna")
[1] 0
attr(,"anyinfnan")
[1] 0
attr(,"anynotascii")
[1] 0
attr(,"anynotutf8")
[1] 0
------
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
 [1]  4  5  7  2  6  9  3 10  1  8

I think we need explain in the comments what is expected to happen (when is expr expected to be linear vs constant).
I wonder if we need to use something like data.table:::setattr(L, "index", NULL) ??

tdhock · 2024-10-03T21:53:05Z

so here is a CI result that shows 3 constant Fast vs 1 linear Fast https://asset.cml.dev/f6abccc722845026bf081f0c9aa93b071bef4a4a?cml=png&cache-bypass=5e8ae8d9-fc0e-4b9b-9831-12442334b5d5

.ci/atime/tests.R

jangorecki · 2024-10-04T09:29:24Z

Not sure what exactly question is about.

tdhock · 2024-10-04T12:56:26Z

Hi @jangorecki we are trying to adapt your examples here #4386 (comment) to be a performance test, but we are not sure about what is the right way to test the new functionality vs the old.
Could you please fill in the TODOs below with "linear" or "constant" ?
(is it true that we expect constant time if old index is used, and linear time if new index must be computed?)

If we first run forderv(retGrp=TRUE) then running forderv(retGrp=TRUE) should take TODO time. (TDH guess TODO=constant?)
If we first run forderv(retGrp=TRUE) then running forderv(retGrp=FALSE) should take TODO time. (TDH guess TODO=constant?)
If we first run forderv(retGrp=FALSE) then running forderv(retGrp=TRUE) should take TODO time. (TDH guess TODO=linear?)
If we first run forderv(retGrp=FALSE) then running forderv(retGrp=FALSE) should take TODO time. (TDH guess TODO=constant?)

I guess I don't understand the difference between retGrp=TRUE and FALSE, is that documented somewhere?

jangorecki · 2024-10-04T16:05:34Z

Rerunning same should be constant. As for the other two cases I cannot look so much in details into it at the moment, but possibly when computing retGrp=F when having retGrp=T we may still need to run forder on the index itself to get the original order from groups (should be fast but not constant). In one of PR was fdistinct() which is very small function that was using those so the exact logic can be looked up there easily.

tdhock · 2024-10-08T11:50:41Z

I think we need explain in the comments what is expected to happen (when is expr expected to be linear vs constant).

I added a comment.

I wonder if we need to use something like data.table:::setattr(L, "index", NULL) ??

We don't, and in fact we don't even need to run forderv in setup, because the first memory measurement run sets the index, for use in subsequent time measurements.

Rerunning same should be constant

Thanks, that helps! I edited the PR so that it only has these two test cases. (instead of all four described above for which we do not have a clear expectation).

I think this PR should be ready to merge now. @Anirban166 can you please review?

The atime results are missing Slow/Fast for some test cases, tdhock/atime#64 which is fixed in github atime (to send to CRAN later this week).

Anirban166 · 2024-10-08T23:27:54Z

I think this PR should be ready to merge now. @Anirban166 can you please review?

Something that I see here, in my fork, and also when I run the tests locally on my machine (sharing the generated plot below, it uses the fixed/latest atime version from GitHub) is that the lines for these two cases in terms of memory isn't being shown for versions other than 'Slow'/'Fast': (base, HEAD, CRAN, and merge-base are stuck below along the x-axis line)

Their individual plots:

retGrp=T	retGrp=F

Is that expected or within reason? (asking since for the other tests I don't notice such; also are these results consistent with what you've been observing locally?)
Otherwise, it looks great to me! (and the remaining tests look unaffected/normal)

The atime results are missing Slow/Fast for some test cases

As per the CI results we're getting, 'some test cases' more specifically would be all the other tests aside from the forderv cases we are introducing here, as they are missing labels for Slow/Fast/Before/Fixed/Regression if you look at both plots (preview and full version)

tdhock/atime#64 which is fixed in github atime (to send to CRAN later this week).

Sweet!

tdhock · 2024-10-09T01:34:21Z

the lines for these two cases in terms of memory isn't being shown for versions other than 'Slow'/'Fast': (base, HEAD, CRAN, and merge-base are stuck below along the x-axis line)

thanks for sharing. This result implies that no memory is used/measurable during the test for those versions. I don't understand why though.

.ci/atime/tests.R

Anirban166 · 2024-10-11T07:33:14Z

.ci/atime/tests.R

+    dt <- data.table(group = rep(1:2, l=N))
+  }),
+  expr = substitute({
+    old.opt <- options(datatable.forder.auto.index = TRUE) # required for test, un-documented, comments in forder.c say it is for debugging only.


I think Jan mentioned that it is only for 'debugging' since it was not exported (coming from his comment here), but we can probably ask him for more information if we want to document it or have more details (otherwise yes, not having this option set to TRUE will not show performance improvement)

Added the test with minor changes to comments

5a15dd2

Anirban166 requested a review from tdhock as a code owner October 3, 2024 19:38

Narrow down on N

7e822ad

tdhock reviewed Oct 3, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

tdhock reviewed Oct 3, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

Go with default N spec

4ecb240

Set seed for reproducibility given the use of sample()

b0c10f5

tdhock reviewed Oct 3, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

Anirban166 and others added 2 commits October 3, 2024 13:05

rm sort reuse option

ba72deb

for loop over retGrp=T,F

c519de1

Toby Dylan Hocking added 4 commits October 3, 2024 17:11

historical sha1s from commits page instead of merge commit

f24984a

rm > from test name since that fails on CI

6e69305

remove extra blank lines

401f3d7

use setattr-> Fast always linear

7800a62

tdhock reviewed Oct 3, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

tdhock mentioned this pull request Oct 4, 2024

historical references not shown for some tests tdhock/atime#64

Closed

tdhock added 2 commits October 7, 2024 23:04

forderv retGrp=T or F, no setindex

7a16341

more detailed comment about expectations

6603d06

tdhock added 2 commits October 8, 2024 07:57

restore old option after test

6509080

log

4e69115

Anirban166 commented Oct 11, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

Anirban166 commented Oct 11, 2024

View reviewed changes

undo white space addition

13c4b1f

tdhock merged commit 9b8d496 into master Oct 11, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Anirban166 commented Oct 3, 2024

codecov bot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

tdhock commented Oct 3, 2024 •

edited

Loading

tdhock commented Oct 3, 2024

jangorecki commented Oct 4, 2024

tdhock commented Oct 4, 2024

jangorecki commented Oct 4, 2024 •

edited

Loading

tdhock commented Oct 8, 2024

Anirban166 commented Oct 8, 2024

tdhock commented Oct 9, 2024

Anirban166 Oct 11, 2024

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Conversation

Anirban166 commented Oct 3, 2024

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 3, 2024 • edited Loading

tdhock commented Oct 3, 2024 • edited Loading

tdhock commented Oct 3, 2024

jangorecki commented Oct 4, 2024

tdhock commented Oct 4, 2024

jangorecki commented Oct 4, 2024 • edited Loading

tdhock commented Oct 8, 2024

Anirban166 commented Oct 8, 2024

tdhock commented Oct 9, 2024

Anirban166 Oct 11, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

tdhock commented Oct 3, 2024 •

edited

Loading

jangorecki commented Oct 4, 2024 •

edited

Loading