Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train process is blocked when kineto is processing traceEvents #953

Closed
staugust opened this issue Jun 19, 2024 · 2 comments
Closed

Train process is blocked when kineto is processing traceEvents #953

staugust opened this issue Jun 19, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@staugust
Copy link
Contributor

staugust commented Jun 19, 2024

When using on-demand profiling via dynolog and kineto, we noticed that, when profiling request configured with iterations, the last profiling iteration took more time than other profiling iterations. The train process is blocked at optimizer.step(), which calls step in kineto, finally, in performRunLoop, libkineto::api().client()->stop() took the most time.

At the same time, the processTraceInternal is executed asynchronously in performRunLoop, which will not block torch train process.

I'm wondering whether there's a plan to fix this performance issue to make minimal overhead on pytorch training process when on-demand profiling is enabled. it would be very nice if there's already a plan or a proposal. If not, I'd like to make a proposal later.

@sraikund16
Copy link
Contributor

cc @briancoutinho

@aaronenyeshi aaronenyeshi added the bug Something isn't working label Jul 24, 2024
@sraikund16 sraikund16 self-assigned this Oct 14, 2024
facebook-github-bot pushed a commit that referenced this issue Oct 15, 2024
Summary:
This fix issue #953. Makes `libkineto::api().client()->stop()` and `stopTraceInternal` run in `profilerThread_`  so that, the training  process will not be blocked.

Pull Request resolved: #966

Reviewed By: sanrise

Differential Revision: D64214259

Pulled By: sraikund16

fbshipit-source-id: a27398e266df8502579d49b2c7d65c1863788008
@staugust
Copy link
Contributor Author

As pr #966 is merged, I think this issue can be closed. Thank you very much @sraikund16 @aaronenyeshi @briancoutinho @sanrise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants