Revisit optimizing the integration test runtime #2993

mimowo · 2024-09-05T12:53:38Z

We have had an effort to improve the integration test performance in the past, and we've taken down the time below 10min.

However, recently the integration tests suite takes over 16min based on https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-integration-main. Specific example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-integration-main/1831572082656284672

Part of the effort would be to figure out if the slow down can be attributed to more tests, or there is another reason.

The build time is important particularly during release process which takes a couple of builds.

mimowo · 2024-09-05T12:53:51Z

/cc @tenzen-y @alculquicondor @gabesaba

mimowo · 2024-09-09T07:13:52Z

FYI the previous effort (for x-ref): #1737

trasc · 2024-09-10T07:41:38Z

/assign

tenzen-y · 2024-09-11T15:57:09Z

Actually, I proposed refactoring the MultiKueue Kubeflow E2E and integration testing.
@mszadkow How about progressing?

mimowo · 2024-09-11T16:25:46Z

@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance?

Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since #1737 (comment), which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression).

EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized.

trasc · 2024-09-13T06:59:15Z

For starters I propose #3035 which adds a way to report the time taken by individual tests.

trasc · 2024-09-13T07:24:15Z

When it comes to overall time consumption the trend was fairly study, the biggest bump (of around 2min) I see is around 05 Jul when race detection was added.

$ git log 35586d7539bff45e071d39f7e85ebc87e4245c97..cd89852f2c4d921e2ec51917152f8fdea80eb87d
commit cd89852f2c4d921e2ec51917152f8fdea80eb87d
Author: Irving Mondragón <[email protected]>
Date:   Thu Jul 4 22:51:06 2024 +0200

    Remove deprecated Hugo template (#2506)

commit 6c619c6becea43415bb189067c5e94e8dcda355f
Author: Mykhailo Bobrovskyi <[email protected]>
Date:   Thu Jul 4 20:48:06 2024 +0300

    Runs the race detector on integration tests. (#2468)

mimowo · 2024-09-13T10:29:52Z

I think the race detection proved to be useful already, so the trade off is not clear to me. I would keep it for now, but just keep in mind that we could gain 2min on changing it.

tenzen-y · 2024-09-13T11:01:27Z

@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance?

Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since #1737 (comment), which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression).

EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized.

@mimowo Actually, I meant the following discussion:
#2869 (comment)

mimowo · 2024-09-13T11:10:02Z

@mimowo Actually, I meant the following discussion:
#2869 (comment)

The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e.

tenzen-y · 2024-09-13T11:17:46Z

@mimowo Actually, I meant the following discussion:
#2869 (comment)

The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e.

Oh, I was supposed to mention integration testing as well.
Because we have duplicated Kubeflow MultiKueue integration testing even though the core Kubeflow MultiKueue reconcilers are commonized and the same.

mimowo · 2024-09-13T11:31:52Z

I see, so your suggestion is to only keep integration tests for PyTorchJob, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.

trasc · 2024-09-13T14:49:47Z

I see, so your suggestion is to only keep integration tests for PyTorchJob, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.

That can be easily done but I don't expect a huge gain out of that , each of the tests are taking under 3 sec to execute.

With #3039 I tried to parallelize the cluster creation for multikueue in theory that can reduce the tome with around 40s, but making the setup thread safe is more challenging then expected.

Another thing we can try is to reuse the envtest clusters and setup-manager / stop-manager operations, but this may not make too much difference with parallel running and surface new issues due to incomplete cleanups.

trasc · 2024-09-18T05:37:01Z

Integration restart manager #3054

Adds the ability to reuse the envtest instance, and replace the manager.

One followup for this can be lazy star the envtest instances, this will probably have a bigger code impact and it's benefits will be visible only in suites that have a lower number of top level specs then the parallelism.

tenzen-y · 2024-09-18T15:41:24Z

I see, so your suggestion is to only keep integration tests for PyTorchJob, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.

That may be a little bit different. I'm wondering if we can add all Kubeflow MultiKueue cases only in PyTorchJob and add only basic creation cases in all other Kubeflow Jobs. That is similar to Kubeflow Jobs integration testings (not MultiKueu).

trasc · 2024-09-19T05:46:22Z

That may be a little bit different. I'm wondering if we can add all Kubeflow MultiKueue cases only in PyTorchJob and add only basic creation cases in all other Kubeflow Jobs. That is similar to Kubeflow Jobs integration testings (not MultiKueu).

It is done in #3085

trasc · 2024-10-01T12:55:48Z

With #3085 merged

[kueue] Skip slow and redundant integration test in PR builds. kubernetes/test-infra#33568
will skip the slow and redundant test in pr builds.

With this the testing time will get from 18min to around 12min for PRs, and around 14-45min for periodic builds.

alculquicondor · 2024-10-01T16:37:50Z

Wonderful, thank you!

tenzen-y · 2024-10-01T16:42:39Z

With #3085 merged

[kueue] Skip slow and redundant integration test in PR builds. kubernetes/test-infra#33568

will skip the slow and redundant test in pr builds.

With this the testing time will get from 18min to around 12min for PRs, and around 14-45min for periodic builds.

Great to see! Thanks!
Will you plan to do any performance improving, or not?

mimowo · 2024-10-01T16:45:35Z

Ideally it would be nice to be around 10min, but maybe keeping the goal is not feasible as the project grows, so I think we can close.

Unless @trasc you have some more ideas you want to follow up with?

mimowo · 2024-10-02T08:47:50Z

The recent #3176 inspires me to ask if we could re-visit reducing the podsReadyTimeout used in a couple of tests, like here, or here to use the TinyTimeout rather than ShortTimeout, can you check that yet @mbobrovskyi or @trasc ?

mimowo · 2024-10-02T08:53:36Z

@trasc in #3054 I see you introduced StartManager and used in many tests suites to optimize performance, but I see some suites still use RunManager, example from search. Is it an omission and we should follow up, or is there a reason to keep it that way?

mbobrovskyi · 2024-10-02T08:58:44Z

The recent #3176 inspires me to ask if we could re-visit reducing the podsReadyTimeout used in a couple of tests, like here, or here to use the TinyTimeout rather than ShortTimeout, can you check that yet @mbobrovskyi or @trasc ?

Unfortunately, no. I already optimized this in #2329, and this is the minimum we can set.

mimowo · 2024-10-02T09:00:34Z

Ack

trasc · 2024-10-02T09:01:32Z

@trasc in #3054 I see you introduced StartManager and used in many tests suites to optimize performance, but I see some suites still use RunManager, example from search. Is it an omission and we should follow up, or is there a reason to keep it that way?

In some suites we don't need to restart the manager with a different configuration.

mimowo · 2024-10-02T09:14:36Z

So, IIUC for these suites does not matter from performance PoV, because the manager is started only once anyway. Still, it might be worth following up to make it consistent, as people would often copy-paste the existing tests, and if they copy the ones using RunManager we may not have optimal performance in the future.

trasc · 2024-10-02T11:04:18Z

So, IIUC for these suites does not matter from performance PoV, because the manager is started only once anyway. Still, it might be worth following up to make it consistent, as people would often copy-paste the existing tests, and if they copy the ones using RunManager we may not have optimal performance in the future.

[test][integration] Drop fwk.RunManager in favor of SetupClient and StartManager #3179

trasc · 2024-10-02T11:38:55Z

One thing I wanted to check was the impact of parallelism and --race usage for which I did a couple of runs in #3142 and in short we have

	NProcs = 4	NProcs = 2	NProcs = 1
With `--race`	13m24s 1841394016327831552	12m38s 1841397714319839232	14m59s 1841401466707775488
Without `--race`	10m17s 1841432977301573632	10m40s 1841429657497374720	12m43s 1841424736974802944

The overhead of --race is around 2min.
The benefit of using a parallelism > 2 is limited

So besides maybe dropping the --race in PR builds, which may not be the best course of action, I cannot think of anything else to try at this point.

I guess we can close this issue.

mimowo · 2024-10-02T12:01:10Z

Thanks for the summary.

Regarding NProcs 2 vs. 4, the performance differences aren't big, and are not very consistent. However, I'd prefer to keep it as 4 since higher parallelism might be helpful when we have more tests in the future, and may help us to expose flakes (similar effect as --race).

Regarding the --race flag, it helps us to detect flakes, and we have a couple of those, so it is worth using it even at the cost of 2min.

I think what could make sense is to drop this flag for presubmits, and only use it for periodic tests. Ideally, local runs also enable the flag by default. WDYT @tenzen-y @alculquicondor @trasc ?

mimowo · 2024-10-02T12:22:42Z

I also think it would make sense to have a generic env. var like PRESUBMIT, and based on it control the optimizations like INTEGRATION_RACE=false, or INEGRATION_RUN_ALL=false (or directly set the INTEGRATION_TEST_FILTERS). Then, if we change our decisions about presubmit config we don't need to update two places. WDYT?

trasc · 2024-10-02T12:45:27Z

I think what could make sense is to drop this flag for presubmits, and only use it for periodic tests. Ideally, local runs also enable the flag by default. WDYT @tenzen-y @alculquicondor @trasc ?

Even the race detection might be sometimes flaky for some parts of the code, maybe due to some unfortunate test ordering, I still think it's better to continue doing the check in presubmit.

tenzen-y · 2024-10-02T15:10:45Z

I think what could make sense is to drop this flag for presubmits, and only use it for periodic tests. Ideally, local runs also enable the flag by default. WDYT @tenzen-y @alculquicondor @trasc ?

Before we enable the race detection, there are many race issues in our integration testing: #2468

So, I'm suspecting that the periodic bot often fail by race issue once we diable it in the presubmit.

alculquicondor · 2024-10-02T18:40:55Z

I think what could make sense is to drop this flag for presubmits, and only use it for periodic tests.

I would not do that. When a race makes it to the main branch, it becomes our problem to solve. The contributor that introduced the change, unfortunately, might not be available.

mimowo · 2024-10-03T05:56:45Z

Ok, thank you for the input, we have a consensus here.

/close
Thank you for the work on the issue, the gain is great. Feel free to submit follow up PRs or open issues if you have more ideas for improvement.

k8s-ci-robot · 2024-10-03T05:56:49Z

@mimowo: Closing this issue.

In response to this:

Ok, thank you for the input, we have a consensus here.

/close
Thank you for the work on the issue, the gain is great. Feel free to submit follow up PRs or open issues if you have more ideas for improvement.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Sep 5, 2024

k8s-ci-robot assigned trasc Sep 10, 2024

trasc mentioned this issue Sep 13, 2024

[tests] Add ginkgo tests time analysis #3035

Merged

trasc mentioned this issue Sep 13, 2024

[test][integration] Start the multikueue test clusters in parallel. #3039

Merged

This was referenced Sep 13, 2024

[integration][framework]Put bootstrapping in it's own By. #3042

Merged

Integration restart manager #3054

Merged

trasc mentioned this issue Sep 19, 2024

[test][integration] Add spec labels support #3085

Merged

alculquicondor mentioned this issue Sep 19, 2024

Add a BaseWebhook that replaces all kubeflow webhooks #3102

Merged

trasc mentioned this issue Sep 24, 2024

[test][integration][ray] Fix cleanup for flavors and priority-class #3123

Merged

trasc mentioned this issue Oct 2, 2024

[test][integration] Drop fwk.RunManager in favor of SetupClient and StartManager #3179

Merged

k8s-ci-robot closed this as completed Oct 3, 2024

Revisit optimizing the integration test runtime #2993

Revisit optimizing the integration test runtime #2993

Comments

mimowo commented Sep 5, 2024 • edited Loading

mimowo commented Sep 5, 2024

mimowo commented Sep 9, 2024 • edited Loading

trasc commented Sep 10, 2024

tenzen-y commented Sep 11, 2024

mimowo commented Sep 11, 2024 • edited Loading

trasc commented Sep 13, 2024 • edited Loading

trasc commented Sep 13, 2024 • edited Loading

mimowo commented Sep 13, 2024

tenzen-y commented Sep 13, 2024

mimowo commented Sep 13, 2024

tenzen-y commented Sep 13, 2024

mimowo commented Sep 13, 2024

trasc commented Sep 13, 2024

trasc commented Sep 18, 2024

tenzen-y commented Sep 18, 2024

trasc commented Sep 19, 2024

trasc commented Oct 1, 2024

alculquicondor commented Oct 1, 2024

tenzen-y commented Oct 1, 2024

mimowo commented Oct 1, 2024

mimowo commented Oct 2, 2024

mimowo commented Oct 2, 2024

mbobrovskyi commented Oct 2, 2024 • edited Loading

mimowo commented Oct 2, 2024

trasc commented Oct 2, 2024

mimowo commented Oct 2, 2024

trasc commented Oct 2, 2024

trasc commented Oct 2, 2024

mimowo commented Oct 2, 2024

mimowo commented Oct 2, 2024 • edited Loading

trasc commented Oct 2, 2024

tenzen-y commented Oct 2, 2024

alculquicondor commented Oct 2, 2024

mimowo commented Oct 3, 2024

k8s-ci-robot commented Oct 3, 2024

mimowo commented Sep 5, 2024 •

edited

Loading

mimowo commented Sep 9, 2024 •

edited

Loading

mimowo commented Sep 11, 2024 •

edited

Loading

trasc commented Sep 13, 2024 •

edited

Loading

trasc commented Sep 13, 2024 •

edited

Loading

mbobrovskyi commented Oct 2, 2024 •

edited

Loading

mimowo commented Oct 2, 2024 •

edited

Loading