Fix param input order for cudagraph #1138

yifeis-nv · 2024-08-27T03:19:32Z

Description

I discovered that when I attempt to use cudagraph during Pipeline Parallelism, the gradient becomes incorrect, ultimately leading to a NaN issue. After debugging, I identified a small bug in TE's graph.py.

Fixes # (issue)

Here is the translation of your text into English for your GitHub issue description: Since the make_graphed_callables function in TE implements the backward graph through the torch.autograd.grad function, the weights are also passed into the torch.autograd.grad function through the inputs. This requires that the order of inputs in torch.autograd.grad matches the order in the forward graph; otherwise, it will lead to backward errors.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Modify the input order of weights inside of cudagraph related module

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

timmoon10

This fix seems plausible. It seems that make_graphed_callables expects sample_args to be ordered first by layer number, then by microbatch, then by model chunk:

TransformerEngine/transformer_engine/pytorch/graph.py

Lines 236 to 238 in 0ece13e

    
           per_callable_fwd_idx = (m_chunk * num_microbatches * num_layers) + ( 
        
               fwd_idx[m_chunk] * num_layers + l_no 
        
           )

However, I see some of our MLPerf wrappers order by microbatch, then layer number, then model chunk: https://gitlab-master.nvidia.com/dl/mlperf/optimized/-/blob/main/large_language_model/pytorch/custom_callbacks.py#L249-L254
Pinging @ksivaman.

Also, can you sign your commit to pass the DCO check?

Signed-off-by: yifeis-nv <[email protected]>

yifeis-nv · 2024-08-28T03:09:32Z

This fix seems plausible. It seems that make_graphed_callables expects sample_args to be ordered first by layer number, then by microbatch, then by model chunk:

TransformerEngine/transformer_engine/pytorch/graph.py

Lines 236 to 238 in 0ece13e

per_callable_fwd_idx = (m_chunk * num_microbatches * num_layers) + (

fwd_idx[m_chunk] * num_layers + l_no

)

However, I see some of our MLPerf wrappers order by microbatch, then layer number, then model chunk: https://gitlab-master.nvidia.com/dl/mlperf/optimized/-/blob/main/large_language_model/pytorch/custom_callbacks.py#L249-L254
Pinging @ksivaman.
Also, can you sign your commit to pass the DCO check?

THX for your reminder! I have signed my commit.
Based on my understanding of the code, the order you referenced from MLPerf does not affect the capture order within the make_graphed_callables function. When performing the capture, it still follows the sequence of first by layer number, then by microbatch, and finally by model chunk. Therefore, the issue described earlier will still occur. I understand that this is why there is a modification in the code to isolate the captures of different microbatches (which will prevent sharing the memory pool and is likely to increase memory overhead):
https://gitlab-master.nvidia.com/dl/mlperf/optimized/-/blob/main/large_language_model/pytorch/custom_callbacks.py#L216-237

vasunvidia · 2024-09-05T01:45:59Z

transformer_engine/pytorch/graph.py

@@ -171,8 +171,8 @@ def _make_graphed_callables(
        ]
    else:
        per_callable_module_params = []
-        for c in callables:
-            for i in range(num_microbatches):
+        for i in range(num_microbatches):


The change doesn't appear to fully solve the bug. For example this fix will work only when the number of model chunks (num_model_chunks) is 1. The correct solution will be

for m_chunk in range(num_model_chunks): for idx in range(num_microbatches): for l_no in range(num_layers): per_callable_module_params.append( tuple(callables[m_chunk*num_layers + l_no].parameters()) if isinstance(c, torch.nn.Module) else () )

Can you test if this fix works?

Thanks for your input! This works for my situation.

Signed-off-by: yifeis-nv <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-10-03T01:08:08Z

/te-ci pytorch

timmoon10 reviewed Aug 27, 2024

View reviewed changes

timmoon10 requested a review from ksivaman August 27, 2024 23:22

timmoon10 added the bug Something isn't working label Aug 27, 2024

Fix param input order for cudagraph

4235bfe

Signed-off-by: yifeis-nv <[email protected]>

yifeis-nv force-pushed the yifeis/cudagraph branch from 0ece13e to 4235bfe Compare August 28, 2024 02:58

vasunvidia reviewed Sep 5, 2024

View reviewed changes

yifeis-nv and others added 3 commits September 6, 2024 15:13

Addtional Support for Muti-Chunks

f876a23

Signed-off-by: yifeis-nv <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ba9465

for more information, see https://pre-commit.ci

Merge branch 'main' into yifeis/cudagraph

2a4a478

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix param input order for cudagraph #1138

Fix param input order for cudagraph #1138

yifeis-nv commented Aug 27, 2024 •

edited

Loading

timmoon10 left a comment

yifeis-nv commented Aug 28, 2024

vasunvidia Sep 5, 2024 •

edited

Loading

yifeis-nv Sep 6, 2024

timmoon10 commented Oct 3, 2024

	per_callable_fwd_idx = (m_chunk * num_microbatches * num_layers) + (
	fwd_idx[m_chunk] * num_layers + l_no
	)

Fix param input order for cudagraph #1138

Are you sure you want to change the base?

Fix param input order for cudagraph #1138

Conversation

yifeis-nv commented Aug 27, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 left a comment

Choose a reason for hiding this comment

yifeis-nv commented Aug 28, 2024

vasunvidia Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

yifeis-nv Sep 6, 2024

Choose a reason for hiding this comment

timmoon10 commented Oct 3, 2024

yifeis-nv commented Aug 27, 2024 •

edited

Loading

vasunvidia Sep 5, 2024 •

edited

Loading