Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Oct 5, 2024

Resolves #2896

Some details about this PR:

  • I moved the actor-local singleton out of PyActorPool into a specialized class PyStatefulActor
  • I changed GPU resources to be accounted on a per-device level. That resulted in creating the data class PyRunnerResources to store both the available resources and the ones used by a task or actor. The runner resources includes not only amount of CPU and memory resources, but the exact GPUs that each task/actor is using, which enables setting CUDA_VISIBLE_DEVICES in actors.
  • I added validation that GPU resources must be integers if greater than 1, which means it is no longer accurate to request for actor_resource_requests * num_workers anymore, so the actor pool context now asks for them serially.

@github-actions github-actions bot added the enhancement New feature or request label Oct 5, 2024
Copy link

codspeed-hq bot commented Oct 5, 2024

CodSpeed Performance Report

Merging #3002 will not alter performance

Comparing kevin/stateful-udf-rank (2ae44c1) with main (73ff3f3)

Summary

✅ 17 untouched benchmarks

@kevinzwang kevinzwang marked this pull request as ready for review October 5, 2024 04:26
_DaftActorContext = DaftActorContext()


def get_actor_context() -> DaftActorContext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe a classmethod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a function because that's what we do for get_context(). Would we want to do something different here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking potentially a class singleton on DaftActorContext, and then a classmethod on DaftActorContext that's like get_or_create_singleton(cls)

try:
nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-x", "-q"])
except Exception:
nvml_h = CDLL("libnvidia-ml.so.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samster25 can you take a look here at this GPU discovery code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just discussed offline with @samster25 -- any thoughts on why we have to do this? Is it because nvidia-smi doesn't respect CUDA_VISIBLE_DEVICES?

We should write some docs too talking about this.

We do think it's a little spooky, but could be cool to remove our xml library dependency after this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, nvidia-smi did not seem to respect CUDA_VISIBLE_DEVICES but that might be something I'm doing wrong since most people online report that it does.

The reason I have this is actually that the GPU numbers outputted by nvidia-smi may not actually map to the correct device numbers: https://forums.developer.nvidia.com/t/cuda-visible-devices-being-ignored/41808/5

As for the xml library, I believe it's actually part of the python std lib.

daft/runners/pyrunner.py Outdated Show resolved Hide resolved
daft/runners/pyrunner.py Show resolved Hide resolved
@jaychia
Copy link
Contributor

jaychia commented Oct 7, 2024

Took a quick first pass, but I'll probably need to give it a much more thorough review again given how much logic is changing in the PyRunner

Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 81.68317% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.48%. Comparing base (272163f) to head (2ae44c1).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/common/resource-request/src/lib.rs 70.58% 15 Missing ⚠️
daft/runners/pyrunner.py 91.58% 9 Missing ⚠️
daft/internal/gpu.py 55.55% 8 Missing ⚠️
daft/context.py 77.27% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3002      +/-   ##
==========================================
+ Coverage   78.39%   78.48%   +0.09%     
==========================================
  Files         603      610       +7     
  Lines       71443    71836     +393     
==========================================
+ Hits        56005    56384     +379     
- Misses      15438    15452      +14     
Files with missing lines Coverage Δ
daft/dependencies.py 58.33% <ø> (+0.64%) ⬆️
daft/runners/ray_runner.py 80.90% <100.00%> (+0.21%) ⬆️
...al_optimization/rules/split_actor_pool_projects.rs 95.00% <100.00%> (-0.07%) ⬇️
daft/context.py 79.09% <77.27%> (-0.40%) ⬇️
daft/internal/gpu.py 59.09% <55.55%> (-2.45%) ⬇️
daft/runners/pyrunner.py 86.86% <91.58%> (+1.60%) ⬆️
src/common/resource-request/src/lib.rs 66.86% <70.58%> (-0.22%) ⬇️

... and 48 files with indirect coverage changes

_DaftActorContext = DaftActorContext()


def get_actor_context() -> DaftActorContext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking potentially a class singleton on DaftActorContext, and then a classmethod on DaftActorContext that's like get_or_create_singleton(cls)

try:
nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-x", "-q"])
except Exception:
nvml_h = CDLL("libnvidia-ml.so.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just discussed offline with @samster25 -- any thoughts on why we have to do this? Is it because nvidia-smi doesn't respect CUDA_VISIBLE_DEVICES?

We should write some docs too talking about this.

We do think it's a little spooky, but could be cool to remove our xml library dependency after this


def cuda_visible_devices() -> list[str]:
"""Get the list of CUDA devices visible to the current process."""
visible_devices = _parse_visible_devices()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create additional function hop here for such a simple call. Can be done in 4 lines with minimal nesting:

cuda_visible_devices_envvar = os.getenv("CUDA_VISIBLE_DEVICES)

if cuda_visible_devices_envvar is None:
    return [str(i) for i in range(_raw_device_count_nvml())]

return [device.strip() for device in cuda_visible_devices_envvar.split(",") if device.strip()]

from daft.context import _set_actor_context

_set_actor_context(rank=rank, resource_request=resource_request)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have to worry about all this for Ray -- Ray takes care of setting the CUDA_VISIBLE_DEVICES for you when request a GPU

(please verify what I just said though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_set_actor_context doesn't handle setting CUDA_VISIBLE_DEVICES, it only sets the data retrieved from daft.context.get_actor_context()

src/common/resource-request/src/lib.rs Show resolved Hide resolved
}

#[must_use]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does must_use do, and is it fine to remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must_use requires that the return value of the function is not discarded when called. Result actually already includes this so when I changed these methods to return that, I removed the #[must_use] on them. Clippy yells at me otherwise

daft/runners/pyrunner.py Show resolved Hide resolved
tests/actor_pool/test_actor_context.py Outdated Show resolved Hide resolved

return all((cpus_okay, gpus_okay, memory_okay))

def choose_gpus_to_acquire(self, num_gpus: float) -> dict[str, float]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring? Why is this returning a dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed PyRunnerResources so it encapsulates more of the abstraction, should take care of this

if chosen_gpu is None:
raise ValueError(f"Not enough GPU resources to acquire {num_gpus} GPUs from {self}")

return {chosen_gpu: num_gpus}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man... like I think think works, but it's kinda scary logic that seems quite brittle without us being able to test it.

We can actually likely mock out GPUs to run in our unit tests, if we do some dependency injection (e.g. have a little GPUDeviceProvider class that we can replace with a mocked version that we use at testing-time). Let's talk about that tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mocked tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ActorPoolProject] Correctly allocate GPUs for each running Actor in the PyRunner
2 participants