Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ask a question #21

Open
Curious-L opened this issue Mar 19, 2024 · 3 comments
Open

Ask a question #21

Curious-L opened this issue Mar 19, 2024 · 3 comments
Assignees

Comments

@Curious-L
Copy link

Curious-L commented Mar 19, 2024

Excuse me, I've recently been studying the excellent work of your team and I have a doubt:
My understanding of the whole process is that it first goes through step (a), then step (b) which generates the prompt, followed by the LLM in part (c) outputting results (such as what action to take next) through the prompt. By the final step (d), the system calls the PPO algorithm separately for policy generation and at the same time compares it with the output of the LLM. So, I think what the PPO is fine-tuning is actually the output of the LLM, but the description in the paper seems to indicate that the LLM itself was fine-tuned using PPO. This is where I'm not sure. Would you mind clarifying this for me?
Thank you very much!

@ClementRomac ClementRomac self-assigned this Mar 19, 2024
@ClementRomac
Copy link
Contributor

ClementRomac commented Mar 19, 2024

Hey, the figure indeed summarizes the process from a high level perspective but the details in the paper are what is really happening: we do fine-tune the whole LLM. So, to be precise, given the description returned by the environment and the goal, we construct a prompt (this is hardcoded). Then, we give this prompt to the LLM and compute the log probabilities of all the possible actions to follow this prompt. This is the policy (hence the LLM) and we sample actions according to these log probabilities. After collecting N steps, we compute the PPO loss and fine-tune the whole LLM according to it.
Let me know if there is anything still unclear. I can also point you to pieces of code that may help understand what is really happening.

@Curious-L
Copy link
Author

Thanks!
Could you kindly inform me about the computational resources required for fine-tuning, including the size of the dataset, tokens, and duration for completing an experiment (across several iterations)? Additionally, I would like to know the minimum resources needed for reproducing experiments on a small scale. Thank you very much!

@ClementRomac
Copy link
Contributor

Hi,

Details concerning computational resources can be found in the end of Appendix E of our paper: https://arxiv.org/abs/2302.02662.

We did not report the number of tokens and there is no dataset when using GLAM (i.e. Online RL).
Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants