worker_process_entrypoint FAILED #15

prashanthcheemala · 2024-07-25T16:10:48Z

llama inference start
/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/llama_toolchain/utils.py:43: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
initialize(config_path=relative_path)
Loading config from : /root/.llama/configs/inference.yaml
Yaml config:

inference_config:
impl_config:
impl_type: inline
checkpoint_config:
checkpoint:
checkpoint_type: pytorch
checkpoint_dir: /root/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/
tokenizer_path: /root/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
model_parallel_size: 1
quantization_format: bf16
quantization: null
torch_seed: null
max_seq_len: 16384
max_batch_size: 1

Listening on :::5000
INFO: Started server process [6765]
INFO: Waiting for application startup.

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: -9) local_rank: 0 (pid: 6774) of fn: worker_process_entrypoint (start_method: fork)
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] self._pc.join(-1)
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 170, in join
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] raise ProcessExitedException(
E0725 12:48:53.920000 127614093211456 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Process ForkProcess-1:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/llama_toolchain/inference/parallel_utils.py", line 175, in launch_dist_group
elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/LLama_Agentic_System/llama3_1venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
worker_process_entrypoint FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-25_12:48:53
host : ip-119-181-1-31.ec2.internal
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 6774)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 6774

Christopher-Dreamwalker · 2024-07-26T23:29:21Z

LLaMA says: Chat
🦙
I'm an open-source chatbot.
I can explain concepts, write poems and code, solve logic puzzles, or even name your pets.
What do you want to chat about?
🥸
Hi any idea how to fix: Root Cause (first observed failure): [0]: time : 2024-07-25_12:48:53 host : ip-119-181-1-31.ec2.internal rank : 0 (local_rank: 0) exitcode : -9 (pid: 6774) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 6774
🦙
The error message you're seeing typically indicates that your process was terminated by the operating system, usually due to resource constraints such as running out of memory. Here are some steps you can take to try and resolve the issue:

Check system logs: Look into system logs (e.g., /var/log/syslog or /var/log/messages) around the time of the error (2024-07-25_12:48:53) to see if there are any messages indicating why the process was killed.
Monitor system resources: Use tools like top, htop, or atop to monitor your system's resource usage (CPU, memory, disk, etc.) while running your application. This can help you identify if you're running low on any resources.
Increase resources: If you're running your application on an AWS EC2 instance (as indicated by the hostname ip-119-181-1-31.ec2.internal), you might need to upgrade to a larger instance type with more memory or adjust your application's configuration to use fewer resources.
Configure OOM Killer: If your process is being killed by the Out-of-Memory (OOM) Killer, you might want to consider configuring it to spare your critical processes or to start killing other, less important processes first.
Check your application's memory usage: Use profiling tools to analyze your application's memory usage patterns. For example, if you're running a Python application, you could use tools like mprof or line_profiler to identify memory leaks.
Implement retries or checkpoints: If your application is performing a long-running task, consider implementing retries or checkpoints so that if it's terminated, it can resume from where it left off.
Check for any resource-intensive tasks: Review your code for any resource-intensive tasks that might be causing the system to run low on resources.
System configuration: Ensure your system's configuration is optimal for your workload. This includes adjusting ulimits, kernel parameters, or other system settings.
If none of these suggestions help, please provide more details about your application, its environment, and any relevant configurations, and I'll do my best to provide a more specific solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker_process_entrypoint FAILED #15

worker_process_entrypoint FAILED #15

prashanthcheemala commented Jul 25, 2024

Christopher-Dreamwalker commented Jul 26, 2024

worker_process_entrypoint FAILED #15

worker_process_entrypoint FAILED #15

Comments

prashanthcheemala commented Jul 25, 2024

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-07-25_12:48:53 host : ip-119-181-1-31.ec2.internal rank : 0 (local_rank: 0) exitcode : -9 (pid: 6774) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 6774

Christopher-Dreamwalker commented Jul 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-25_12:48:53
host : ip-119-181-1-31.ec2.internal
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 6774)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 6774