Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the train_language_agent.py without using slurm #11

Open
yanxue7 opened this issue Sep 14, 2023 · 2 comments
Open

How to run the train_language_agent.py without using slurm #11

yanxue7 opened this issue Sep 14, 2023 · 2 comments

Comments

@yanxue7
Copy link

yanxue7 commented Sep 14, 2023

Hi,

Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel

python -m lamorel_launcher.launch --config-path /home/yanxue/Grounding/experiments/configs --config-name local_gpu_config rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py
and my config is


lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: true
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 1
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 1
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    minibatch_size: 3
    parallelism:
      use_gpu: true
      model_parallelism_size: 1
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
  updater_args:

But I meet the error:

[2023-09-14 20:45:32,837][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 3946796) of binary: /home/yanxue/anaconda3/envs/dlp/bin/python
Error executing job with overrides: ['rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/yanxue/Grounding/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
launch_command(accelerate_args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command
multi_gpu_launcher(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
distrib_run.run(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/yanxue/Grounding/experiments/train_language_agent.py FAILED

Failures:
[1]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3946797)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3946796)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you kindly suggest why the error happen?

@yanxue7
Copy link
Author

yanxue7 commented Sep 14, 2023

and I run the example/ppo_finetuning for BabyAI-MixedTrainLocal enviroment in lamorel with modified config

rl_script_args:
  path: ???
  name_environment: 'BabyAI-MixedTrainLocal'

  #'BabyAI-GoToRedBall-v0'
  #'BabyAI-MixedTrainLocal'
  #'BabyAI-GoToRedBall-v0'
  #'BabyAI-MixedTestLocal'
  #'BabyAI-GoToRedBall-v0'
  epochs: 1000
  steps_per_epoch: 1500
  minibatch_size: 64
  gradient_batch_size: 16
  ppo_epochs: 4
  lam: 0.99
  gamma: 0.99
  target_kl: 0.01
  max_ep_len: 1000
  lr: 1e-4
  entropy_coef: 0.01
  value_loss_coef: 0.5
  clip_eps: 0.2
  max_grad_norm: 0.5
  save_freq: 100
  output_dir: "/home/yanxue/lamoral/pposmalltrain"

But it seems failed to train, it only gets score around 0.2 less than 0.6 in your paper

@ClementRomac
Copy link
Contributor

Hi,

Concerning your first issue, the stack trace you provided misses the real issue that happened so I can't tell. Anyway accelerate has some difficulties with launching two processes on a single machine with only 1 GPU. This is why we provided a custom version of accelerate (which is outdated). Could you try these two PRs (1, 2) please? Or manually launch the two processes as shown in Lamorel's documentation.

Concerning your second issue, this is weird. Let me try to launch some experiments and find out what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants