[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

kevinzwang · 2024-10-05T03:00:33Z

Resolves #2896

Some details about this PR:

I moved the actor-local singleton out of PyActorPool into a specialized class PyStatefulActor
I changed GPU resources to be accounted on a per-device level. That resulted in creating the data class PyRunnerResources to store both the available resources and the ones used by a task or actor. The runner resources includes not only amount of CPU and memory resources, but the exact GPUs that each task/actor is using, which enables setting CUDA_VISIBLE_DEVICES in actors.
I added validation that GPU resources must be integers if greater than 1, which means it is no longer accurate to request for actor_resource_requests * num_workers anymore, so the actor pool context now asks for them serially.

codspeed-hq · 2024-10-05T03:13:11Z

CodSpeed Performance Report

Merging #3002 will not alter performance

_{Comparing kevin/stateful-udf-rank (2ae44c1) with main (73ff3f3)}

Summary

✅ 17 untouched benchmarks

jaychia · 2024-10-05T05:59:50Z

daft/context.py

+_DaftActorContext = DaftActorContext()
+
+
+def get_actor_context() -> DaftActorContext:


Nit: maybe a classmethod?

I made it a function because that's what we do for get_context(). Would we want to do something different here?

I was thinking potentially a class singleton on DaftActorContext, and then a classmethod on DaftActorContext that's like get_or_create_singleton(cls)

jaychia · 2024-10-07T06:40:08Z

daft/internal/gpu.py

    try:
-        nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-x", "-q"])
-    except Exception:
+        nvml_h = CDLL("libnvidia-ml.so.1")


@samster25 can you take a look here at this GPU discovery code?

Just discussed offline with @samster25 -- any thoughts on why we have to do this? Is it because nvidia-smi doesn't respect CUDA_VISIBLE_DEVICES?

We should write some docs too talking about this.

We do think it's a little spooky, but could be cool to remove our xml library dependency after this

In my testing, nvidia-smi did not seem to respect CUDA_VISIBLE_DEVICES but that might be something I'm doing wrong since most people online report that it does.

The reason I have this is actually that the GPU numbers outputted by nvidia-smi may not actually map to the correct device numbers: https://forums.developer.nvidia.com/t/cuda-visible-devices-being-ignored/41808/5

As for the xml library, I believe it's actually part of the python std lib.

daft/runners/pyrunner.py

jaychia · 2024-10-07T07:21:54Z

Took a quick first pass, but I'll probably need to give it a much more thorough review again given how much logic is changing in the PyRunner

codecov · 2024-10-07T23:38:02Z

Codecov Report

Attention: Patch coverage is 81.68317% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.48%. Comparing base (272163f) to head (2ae44c1).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
src/common/resource-request/src/lib.rs	70.58%	15 Missing ⚠️
daft/runners/pyrunner.py	91.58%	9 Missing ⚠️
daft/internal/gpu.py	55.55%	8 Missing ⚠️
daft/context.py	77.27%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3002      +/-   ##
==========================================
+ Coverage   78.39%   78.48%   +0.09%     
==========================================
  Files         603      610       +7     
  Lines       71443    71836     +393     
==========================================
+ Hits        56005    56384     +379     
- Misses      15438    15452      +14

Files with missing lines	Coverage Δ
daft/dependencies.py	`58.33% <ø> (+0.64%)`	⬆️
daft/runners/ray_runner.py	`80.90% <100.00%> (+0.21%)`	⬆️
...al_optimization/rules/split_actor_pool_projects.rs	`95.00% <100.00%> (-0.07%)`	⬇️
daft/context.py	`79.09% <77.27%> (-0.40%)`	⬇️
daft/internal/gpu.py	`59.09% <55.55%> (-2.45%)`	⬇️
daft/runners/pyrunner.py	`86.86% <91.58%> (+1.60%)`	⬆️
src/common/resource-request/src/lib.rs	`66.86% <70.58%> (-0.22%)`	⬇️

... and 48 files with indirect coverage changes

jaychia · 2024-10-08T01:31:20Z

daft/context.py

+_DaftActorContext = DaftActorContext()
+
+
+def get_actor_context() -> DaftActorContext:


I was thinking potentially a class singleton on DaftActorContext, and then a classmethod on DaftActorContext that's like get_or_create_singleton(cls)

jaychia · 2024-10-08T01:36:00Z

daft/internal/gpu.py

    try:
-        nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-x", "-q"])
-    except Exception:
+        nvml_h = CDLL("libnvidia-ml.so.1")


Just discussed offline with @samster25 -- any thoughts on why we have to do this? Is it because nvidia-smi doesn't respect CUDA_VISIBLE_DEVICES?

We should write some docs too talking about this.

We do think it's a little spooky, but could be cool to remove our xml library dependency after this

jaychia · 2024-10-08T01:38:57Z

daft/internal/gpu.py

+
+def cuda_visible_devices() -> list[str]:
+    """Get the list of CUDA devices visible to the current process."""
+    visible_devices = _parse_visible_devices()


No need to create additional function hop here for such a simple call. Can be done in 4 lines with minimal nesting:

cuda_visible_devices_envvar = os.getenv("CUDA_VISIBLE_DEVICES) if cuda_visible_devices_envvar is None: return [str(i) for i in range(_raw_device_count_nvml())] return [device.strip() for device in cuda_visible_devices_envvar.split(",") if device.strip()]

jaychia · 2024-10-08T01:40:04Z

daft/runners/ray_runner.py

+        from daft.context import _set_actor_context
+
+        _set_actor_context(rank=rank, resource_request=resource_request)
+


We shouldn't have to worry about all this for Ray -- Ray takes care of setting the CUDA_VISIBLE_DEVICES for you when request a GPU

(please verify what I just said though)

_set_actor_context doesn't handle setting CUDA_VISIBLE_DEVICES, it only sets the data retrieved from daft.context.get_actor_context()

src/common/resource-request/src/lib.rs

jaychia · 2024-10-08T01:42:00Z

src/common/resource-request/src/lib.rs

    }

-    #[must_use]


What does must_use do, and is it fine to remove?

must_use requires that the return value of the function is not discarded when called. Result actually already includes this so when I changed these methods to return that, I removed the #[must_use] on them. Clippy yells at me otherwise

daft/runners/pyrunner.py

tests/actor_pool/test_actor_context.py

jaychia · 2024-10-08T03:18:57Z

daft/runners/pyrunner.py

+
+        return all((cpus_okay, gpus_okay, memory_okay))
+
+    def choose_gpus_to_acquire(self, num_gpus: float) -> dict[str, float]:


Docstring? Why is this returning a dict?

Changed PyRunnerResources so it encapsulates more of the abstraction, should take care of this

jaychia · 2024-10-08T06:19:55Z

daft/runners/pyrunner.py

+            if chosen_gpu is None:
+                raise ValueError(f"Not enough GPU resources to acquire {num_gpus} GPUs from {self}")
+
+            return {chosen_gpu: num_gpus}


Man... like I think think works, but it's kinda scary logic that seems quite brittle without us being able to test it.

We can actually likely mock out GPUs to run in our unit tests, if we do some dependency injection (e.g. have a little GPUDeviceProvider class that we can replace with a mocked version that we use at testing-time). Let's talk about that tomorrow.

Added mocked tests!

kevinzwang added 3 commits October 4, 2024 18:06

[FEAT] Add rank and CUDA_VISIBLE_DEVICES to stateful UDFs

2cd49d5

add rank

4d24cac

change wording

798119e

github-actions bot added the enhancement New feature or request label Oct 5, 2024

kevinzwang added 2 commits October 4, 2024 20:47

add tests

cf871d9

fix tests

2e7f0b2

kevinzwang requested a review from jaychia October 5, 2024 04:21

kevinzwang marked this pull request as ready for review October 5, 2024 04:26

kevinzwang added 4 commits October 4, 2024 21:32

fix other tests

df2507f

add resource request num_gpu check

127807e

fix test

6db6229

fix resource freeing when unable to acquire

821b8e1

jaychia reviewed Oct 7, 2024

View reviewed changes

jaychia mentioned this pull request Oct 7, 2024

[FEAT] Assign CUDA_VISIBLE_DEVICES to actor pools in PyRunner #2882

Closed

kevinzwang added 3 commits October 7, 2024 14:42

fix done callback

4d86eda

Merge branch 'main' into kevin/stateful-udf-rank

25a077c

reduce test actor concurrency to less than CI machine cores lol

65e2ca1

kevinzwang requested a review from jaychia October 7, 2024 23:20

jaychia reviewed Oct 8, 2024

View reviewed changes

kevinzwang added 3 commits October 8, 2024 16:41

partial review changes

81aa3cc

add tests and other review fixes

da17746

fix sql tests

9a89ba8

kevinzwang requested a review from jaychia October 9, 2024 19:47

make test cleaner

2ae44c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

kevinzwang commented Oct 5, 2024 •

edited

Loading

codspeed-hq bot commented Oct 5, 2024 •

edited

Loading

jaychia Oct 5, 2024

kevinzwang Oct 7, 2024

jaychia Oct 8, 2024

jaychia Oct 7, 2024

jaychia Oct 8, 2024

kevinzwang Oct 8, 2024

jaychia commented Oct 7, 2024

codecov bot commented Oct 7, 2024 •

edited

Loading

jaychia Oct 8, 2024

jaychia Oct 8, 2024

jaychia Oct 8, 2024

jaychia Oct 8, 2024

kevinzwang Oct 8, 2024

jaychia Oct 8, 2024

kevinzwang Oct 8, 2024

jaychia Oct 8, 2024

kevinzwang Oct 8, 2024

jaychia Oct 8, 2024

kevinzwang Oct 9, 2024

		_DaftActorContext = DaftActorContext()


		def get_actor_context() -> DaftActorContext:

		from daft.context import _set_actor_context

		_set_actor_context(rank=rank, resource_request=resource_request)


		return all((cpus_okay, gpus_okay, memory_okay))

		def choose_gpus_to_acquire(self, num_gpus: float) -> dict[str, float]:

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

Are you sure you want to change the base?

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

Conversation

kevinzwang commented Oct 5, 2024 • edited Loading

codspeed-hq bot commented Oct 5, 2024 • edited Loading

Merging #3002 will not alter performance

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaychia commented Oct 7, 2024

codecov bot commented Oct 7, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinzwang commented Oct 5, 2024 •

edited

Loading

codspeed-hq bot commented Oct 5, 2024 •

edited

Loading

codecov bot commented Oct 7, 2024 •

edited

Loading