Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run comparative benchmark #11128

Merged
merged 39 commits into from
Oct 19, 2021
Merged

Conversation

fcollonval
Copy link
Member

@fcollonval fcollonval commented Sep 23, 2021

References

Follow-up of the discussion at the JuypterLab dev meeting jupyterlab/frontends-team-compass#128 (comment)

From all the tests carry out in this PR, here are the conclusions:

  • using a single page object instead of one per test, increases the execution time.

  • running 50 experiments are a reasonable number to reach statistical confidence - and using 100 experiments increases the chance some errors occurs (mainly related to test fixture teardown error - I could not figure out which fixture was producing that).

  • for the notebook with code cells, a delay of 500ms is not captured when opening the file (1000ms is though).

  • Not using galata fixtures makes test less flaky (and closing much faster so the helper is bringing lots of slow down). But a fake delay of 500ms is not capture neither on notebook open action with code cell but it is if there are only markdown.

Anyway this is a improvement.

Don't look at the latest benchmark result as they compared master with this branch that is changing the benchmark test - hence the impact on performance.

Code changes

Add interval change computation from jupyterlab/benchmarks repository
Homogeneous interface for the report and the graph creation functions in BenchmarkReporter.

Change the logic for benchmark test:

  • Compute the reference and the challenger in the same job to be hardware independent
    For PR, the reference is the common ancestor with the target branch and the challenger is the head of the PR branch.

  • Set job to run:

    • on approved PR review
    • on PR review containing please run benchmark
    • every sunday to evaluate changes introduced by the last week commits
      This is commented for now
  • Document how to trigger a benchmark.

User-facing changes

N/A

Backwards-incompatible changes

N/A

@jupyterlab-probot
Copy link

Thanks for making a pull request to jupyterlab!
To try out this branch on binder, follow this link: Binder

@fcollonval fcollonval added this to the 4.0 milestone Sep 23, 2021
@fcollonval fcollonval self-assigned this Sep 23, 2021
Copy link
Member Author

@fcollonval fcollonval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment test to trigger the benchmark test

@fcollonval fcollonval force-pushed the ft/bench-compare branch 5 times, most recently from 81f87f7 to 5ce9ffc Compare September 23, 2021 14:41
@fcollonval fcollonval force-pushed the ft/bench-compare branch 3 times, most recently from f04ceee to 60690a1 Compare September 24, 2021 07:05
@fcollonval
Copy link
Member Author

First benchmark report @ 6706bac with number of experiments = 20

The execution time (in milliseconds) are grouped by test file, test type and browser.
For each case, the following values are computed: min <- [1st quartile - median - 3rd quartile] -> max.

Results table
Test file large_code_notebook large_md_notebook
open
chromium
actual 6895 <- [7210 - 7622 - 7819] -> 9317 3608 <- [3756 - 3873 - 4038] -> 4636
expected 6972 <- [7462 - 8080 - 8325] -> 9239 3624 <- [3742 - 3943 - 4084] -> 5416
Mean comparison -9.7% -- -4.7% -- 0.2 -7.4% -- -1.8% -- 3.7
switch-from
chromium
actual 451 <- [460 - 488 - 508] -> 552 312 <- [341 - 363 - 372] -> 390
expected 446 <- [478 - 488 - 527] -> 683 322 <- [342 - 357 - 365] -> 446
Mean comparison -10.9% -- -4.8% -- 1.3 -6.9% -- -1.9% -- 3.1
switch-to
chromium
actual 1727 <- [1788 - 1849 - 1906] -> 2114 1536 <- [1620 - 1669 - 1693] -> 1777
expected 1733 <- [1802 - 1828 - 1962] -> 2165 1539 <- [1603 - 1644 - 1702] -> 1945
Mean comparison -5.7% -- -1.4% -- 2.9 -2.3% -- 0.7% -- 3.7
close
chromium
actual 5399 <- [6242 - 7170 - 8583] -> 16980 3369 <- [3900 - 4088 - 4316] -> 4952
expected 5479 <- [6051 - 6757 - 8168] -> 24997 3550 <- [3853 - 4092 - 4282] -> 4518
Mean comparison -32.1% -- 0.2% -- 32.4 -4.7% -- 0.6% -- 6.0

Changes are computed with expected as reference.

@fcollonval
Copy link
Member Author

fcollonval commented Sep 24, 2021

Second report for 50 experiments @ ba05cad

The execution time (in milliseconds) are grouped by test file, test type and browser.
For each case, the following values are computed: min <- [1st quartile - median - 3rd quartile] -> max.

The mean relative comparison is computed with 95% confidence.

Results table
Test file large_code_notebook large_md_notebook
open
chromium
actual 5292 <- [6461 - 7048 - 7321] -> 8389 3197 <- [3285 - 3397 - 3487] -> 3754
expected 5062 <- [6424 - 7004 - 7236] -> 7506 3187 <- [3262 - 3296 - 3339] -> 3622
Mean relative change 3.5% ± 4.0% 2.5% ± 1.4%
switch-from
chromium
actual 402 <- [441 - 459 - 474] -> 534 287 <- [306 - 320 - 328] -> 372
expected 399 <- [428 - 444 - 466] -> 715 279 <- [300 - 317 - 327] -> 404
Mean relative change 0.8% ± 3.5% 0.7% ± 2.4%
switch-to
chromium
actual 1599 <- [1635 - 1684 - 1724] -> 1915 1379 <- [1430 - 1459 - 1501] -> 1563
expected 1558 <- [1624 - 1671 - 1697] -> 1880 1335 <- [1403 - 1433 - 1454] -> 1547
Mean relative change 1.7% ± 1.7% 2.3% ± 1.2%
close
chromium
actual 5132 <- [7442 - 11427 - 14877] -> 19522 3347 <- [3517 - 3670 - 3789] -> 4235
expected 5182 <- [6805 - 12295 - 14860] -> 15736 3182 <- [3475 - 3569 - 3704] -> 4069
Mean relative change 0.5% ± 14.2% 3.1% ± 2.3%

Changes are computed with expected as reference.

@jupyterlab jupyterlab deleted a comment from github-actions bot Sep 24, 2021
@fcollonval
Copy link
Member Author

fcollonval commented Sep 25, 2021

Third report - 100 experiments @ 55a1b26

The test for markdown file crashed after 51 experiments - only result for the code case are available

The execution time (in milliseconds) are grouped by test file, test type and browser.
For each case, the following values are computed: min <- [1st quartile - median - 3rd quartile] -> max.

The mean relative comparison is computed with 95% confidence.

Results table
Test file large_code_notebook large_md_notebook
open
chromium
actual 4981 <- [6424 - 7015 - 7264] -> 7740 3210 <- [3321 - 3385 - 3449] -> 3779
expected 5189 <- [6616 - 7132 - 7344] -> 7951 3142 <- [3280 - 3330 - 3398] -> 3652
Mean relative change -1.9% ± 2.8% Error: Data have different length
switch-from
chromium
actual 404 <- [440 - 450 - 469] -> 541 300 <- [321 - 329 - 338] -> 399
expected 418 <- [444 - 468 - 482] -> 605 277 <- [304 - 312 - 327] -> 452
Mean relative change -2.6% ± 1.7% Error: Data have different length
switch-to
chromium
actual 1488 <- [1643 - 1683 - 1734] -> 1867 1385 <- [1480 - 1494 - 1516] -> 1562
expected 1500 <- [1659 - 1688 - 1723] -> 1916 1385 <- [1437 - 1461 - 1484] -> 1681
Mean relative change -0.0% ± 1.0% Error: Data have different length
close
chromium
actual 4999 <- [5528 - 10839 - 14874] -> 19392 3268 <- [3581 - 3677 - 3834] -> 4069
expected 4783 <- [5650 - 10864 - 14596] -> 19152 2982 <- [3536 - 3635 - 3713] -> 4027
Mean relative change 1.3% ± 11.7% Error: Data have different length

Changes are computed with expected as reference.

lab-benchmark

@fcollonval fcollonval force-pushed the ft/bench-compare branch 2 times, most recently from 612b245 to 5f2aa11 Compare October 8, 2021 15:54
Copy link
Member Author

@fcollonval fcollonval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run benchmark

@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2021

Benchmark report

The execution time (in milliseconds) are grouped by test file, test type and browser.
For each case, the following values are computed: min <- [1st quartile - median - 3rd quartile] -> max.

The mean relative comparison is computed with 95% confidence.

Results table
Test file large_code_notebook large_md_notebook
open
chromium
actual 7616 <- [8614 - 8729 - 9017] -> 9661 3633 <- [3731 - 3824 - 4287] -> 4419
expected 6505 <- [7136 - 8658 - 9043] -> 10213 4284 <- [4642 - 4722 - 4861] -> 5007
Mean relative change ⚠️ 6.2% ± 2.7% ⚠️ -15.3% ± 1.3%
switch-from
chromium
actual 1027 <- [1080 - 1112 - 1259] -> 1819 684 <- [784 - 871 - 899] -> 1117
expected 481 <- [511 - 530 - 550] -> 708 331 <- [356 - 363 - 374] -> 417
Mean relative change ⚠️ 119.7% ± 6.0% ⚠️ 133.5% ± 5.1%
switch-to
chromium
actual 968 <- [995 - 1005 - 1020] -> 1071 824 <- [925 - 942 - 958] -> 1004
expected 1792 <- [1897 - 1936 - 1983] -> 2145 1649 <- [1677 - 1700 - 1726] -> 1867
Mean relative change ⚠️ -48.1% ± 0.4% ⚠️ -45.3% ± 0.5%
close
chromium
actual 1118 <- [1153 - 1178 - 1215] -> 1546 738 <- [770 - 782 - 835] -> 995
expected 6953 <- [17236 - 19474 - 20559] -> 23850 4157 <- [4864 - 4993 - 5274] -> 5839
Mean relative change ⚠️ -93.3% ± 0.3% ⚠️ -84.0% ± 0.3%

Changes are computed with expected as reference.

@fcollonval fcollonval marked this pull request as ready for review October 9, 2021 10:54
@jtpio
Copy link
Member

jtpio commented Oct 19, 2021

Thanks @fcollonval for working on this, looks great 👍

@blink1073 blink1073 merged commit 963299c into jupyterlab:master Oct 19, 2021
@blink1073
Copy link
Contributor

Nice work @fcollonval!

@fcollonval fcollonval deleted the ft/bench-compare branch October 20, 2021 10:08
@jtpio
Copy link
Member

jtpio commented Nov 10, 2021

Looks like this could potentially be backported to 3.2.x?

This change exposes some useful Galata methods such as initTestPage which don't seem to be on 3.2.x at the moment.

@fcollonval
Copy link
Member Author

It breaks the galata.newPage API and adds a new dependency @stdlib/stats.

If the API change is seen as acceptable, it could indeed be ported (probably by bumping the minor version of @jupyterlab/galata to 4.1.0).

@jtpio
Copy link
Member

jtpio commented Nov 11, 2021

Ah ok, then it's fine we can skip the backport for now 👍

@fcollonval
Copy link
Member Author

Thinking a bit more about it. Actually without the correction brought here newPage won't mock (and isolated) the test as the user may think. Moreover I did not bump galata major version. So I agree with you we can backport it.

The newPage is unlikely to be called when dealing with JupyterLab core or extensions tests as it is a one-page app.

@fcollonval
Copy link
Member Author

@meeseeksdev please backport to 3.2.x

@lumberbot-app
Copy link

lumberbot-app bot commented Nov 11, 2021

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 3.2.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -m1 963299cf92e75dde44a7953d177b616b0ded9894
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #11128: Run comparative benchmark'
  1. Push to a named branch :
git push YOURFORK 3.2.x:auto-backport-of-pr-11128-on-3.2.x
  1. Create a PR against branch 3.2.x, I would have named this PR:

"Backport PR #11128 on branch 3.2.x (Run comparative benchmark)"

And apply the correct labels and milestones.

Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove Still Needs Manual Backport label once the PR gets merged.

If these instruction are inaccurate, feel free to suggest an improvement.

fcollonval added a commit to fcollonval/jupyterlab that referenced this pull request Nov 11, 2021
@github-actions github-actions bot added the status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. label May 29, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation maintenance status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. tag:Testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants