Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction #9

Open
Abcdeabcd opened this issue Sep 2, 2024 · 2 comments
Open

Reproduction #9

Abcdeabcd opened this issue Sep 2, 2024 · 2 comments

Comments

@Abcdeabcd
Copy link

Hello author, Thanks for your excellent work!
I have some questions about code reproduction to ask you. I retrained for 16 epochs on a single 48GB GPU without using distributed training. The reproduced results were: acc:0.4411, comp:0.4156, overall:0.4284, and these metrics did not meet expectations. Is this due to training with a single GPU? Or could there be other reasons?
I would greatly appreciate your answers to my questions!

@KaiqiangXiong
Copy link
Owner

Ensuring stable model training is quite critical. I suggest you check the loss curve.

@Abcdeabcd
Copy link
Author

Thank you very much for your answer. I will analyze from the perspective of loss curve.Is it possible that interrupting training and then resuming using the 'resume' parameter could lead to unstable convergence?
And Do you think that training with a single 48GB GPU will affect the training effect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants