Reproduction #9

Abcdeabcd · 2024-09-02T07:21:36Z

Hello author, Thanks for your excellent work!
I have some questions about code reproduction to ask you. I retrained for 16 epochs on a single 48GB GPU without using distributed training. The reproduced results were: acc:0.4411, comp:0.4156, overall:0.4284, and these metrics did not meet expectations. Is this due to training with a single GPU? Or could there be other reasons？
I would greatly appreciate your answers to my questions!

KaiqiangXiong · 2024-09-02T08:29:41Z

Ensuring stable model training is quite critical. I suggest you check the loss curve.

Abcdeabcd · 2024-09-02T10:02:15Z

Thank you very much for your answer. I will analyze from the perspective of loss curve.Is it possible that interrupting training and then resuming using the 'resume' parameter could lead to unstable convergence?
And Do you think that training with a single 48GB GPU will affect the training effect?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction #9

Reproduction #9

Abcdeabcd commented Sep 2, 2024

KaiqiangXiong commented Sep 2, 2024

Abcdeabcd commented Sep 2, 2024

Reproduction #9

Reproduction #9

Comments

Abcdeabcd commented Sep 2, 2024

KaiqiangXiong commented Sep 2, 2024

Abcdeabcd commented Sep 2, 2024