-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #3002 will not alter performanceComparing Summary
|
_DaftActorContext = DaftActorContext() | ||
|
||
|
||
def get_actor_context() -> DaftActorContext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe a classmethod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a function because that's what we do for get_context()
. Would we want to do something different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking potentially a class singleton on DaftActorContext
, and then a classmethod on DaftActorContext
that's like get_or_create_singleton(cls)
try: | ||
nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-x", "-q"]) | ||
except Exception: | ||
nvml_h = CDLL("libnvidia-ml.so.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samster25 can you take a look here at this GPU discovery code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just discussed offline with @samster25 -- any thoughts on why we have to do this? Is it because nvidia-smi
doesn't respect CUDA_VISIBLE_DEVICES?
We should write some docs too talking about this.
We do think it's a little spooky, but could be cool to remove our xml library dependency after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing, nvidia-smi
did not seem to respect CUDA_VISIBLE_DEVICES
but that might be something I'm doing wrong since most people online report that it does.
The reason I have this is actually that the GPU numbers outputted by nvidia-smi
may not actually map to the correct device numbers: https://forums.developer.nvidia.com/t/cuda-visible-devices-being-ignored/41808/5
As for the xml library, I believe it's actually part of the python std lib.
Took a quick first pass, but I'll probably need to give it a much more thorough review again given how much logic is changing in the PyRunner |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3002 +/- ##
==========================================
+ Coverage 78.39% 78.48% +0.09%
==========================================
Files 603 610 +7
Lines 71443 71836 +393
==========================================
+ Hits 56005 56384 +379
- Misses 15438 15452 +14
|
_DaftActorContext = DaftActorContext() | ||
|
||
|
||
def get_actor_context() -> DaftActorContext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking potentially a class singleton on DaftActorContext
, and then a classmethod on DaftActorContext
that's like get_or_create_singleton(cls)
try: | ||
nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-x", "-q"]) | ||
except Exception: | ||
nvml_h = CDLL("libnvidia-ml.so.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just discussed offline with @samster25 -- any thoughts on why we have to do this? Is it because nvidia-smi
doesn't respect CUDA_VISIBLE_DEVICES?
We should write some docs too talking about this.
We do think it's a little spooky, but could be cool to remove our xml library dependency after this
daft/internal/gpu.py
Outdated
|
||
def cuda_visible_devices() -> list[str]: | ||
"""Get the list of CUDA devices visible to the current process.""" | ||
visible_devices = _parse_visible_devices() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to create additional function hop here for such a simple call. Can be done in 4 lines with minimal nesting:
cuda_visible_devices_envvar = os.getenv("CUDA_VISIBLE_DEVICES)
if cuda_visible_devices_envvar is None:
return [str(i) for i in range(_raw_device_count_nvml())]
return [device.strip() for device in cuda_visible_devices_envvar.split(",") if device.strip()]
from daft.context import _set_actor_context | ||
|
||
_set_actor_context(rank=rank, resource_request=resource_request) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have to worry about all this for Ray -- Ray takes care of setting the CUDA_VISIBLE_DEVICES for you when request a GPU
(please verify what I just said though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_set_actor_context
doesn't handle setting CUDA_VISIBLE_DEVICES
, it only sets the data retrieved from daft.context.get_actor_context()
} | ||
|
||
#[must_use] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does must_use
do, and is it fine to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must_use
requires that the return value of the function is not discarded when called. Result
actually already includes this so when I changed these methods to return that, I removed the #[must_use]
on them. Clippy yells at me otherwise
daft/runners/pyrunner.py
Outdated
|
||
return all((cpus_okay, gpus_okay, memory_okay)) | ||
|
||
def choose_gpus_to_acquire(self, num_gpus: float) -> dict[str, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring? Why is this returning a dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed PyRunnerResources
so it encapsulates more of the abstraction, should take care of this
daft/runners/pyrunner.py
Outdated
if chosen_gpu is None: | ||
raise ValueError(f"Not enough GPU resources to acquire {num_gpus} GPUs from {self}") | ||
|
||
return {chosen_gpu: num_gpus} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man... like I think think works, but it's kinda scary logic that seems quite brittle without us being able to test it.
We can actually likely mock out GPUs to run in our unit tests, if we do some dependency injection (e.g. have a little GPUDeviceProvider class that we can replace with a mocked version that we use at testing-time). Let's talk about that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mocked tests!
Resolves #2896
Some details about this PR:
PyActorPool
into a specialized classPyStatefulActor
PyRunnerResources
to store both the available resources and the ones used by a task or actor. The runner resources includes not only amount of CPU and memory resources, but the exact GPUs that each task/actor is using, which enables settingCUDA_VISIBLE_DEVICES
in actors.actor_resource_requests * num_workers
anymore, so the actor pool context now asks for them serially.