Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TL] Adapt TL Hardware-aware Search Space with Roller #207

Merged
merged 26 commits into from
Oct 2, 2024

Conversation

LeiWang1999
Copy link
Contributor

Though currently we provide a strategy for users to define a search space for given TL Kernel. But it's still be hard and complex to define a precise and efficient search space for dynamic shapes and operators for given OP and backend.

 def get_configs_sm80(self):
        num_stages = 2
        configs = [
            {
                'block_M': 128,
                'block_N': 256,
                'block_K': 32,
                'threads': 128
            },
            {
                'block_M': 256,
                'block_N': 128,
                'block_K': 32,
                'threads': 128
            },
            {
                'block_M': 128,
                'block_N': 128,
                'block_K': 32,
                'threads': 128
            },
        ]
        configs = [{**c, 'num_stages': num_stages} for c in configs]
        return configs

This pull request link search space up with our roller search space.

However, the Block Level TL can not fully utilize the schedule information, for example, TL only provide dedicated three Warp Scheduling Policy:

class GemmWarpPolicy:
    Square = 0
    FullRow = 1
    FullCol = 2

The select_scheduler function in the dense/__init__.py module has been refactored to use a fine-grained interface. This change provides more flexibility and enables the implementation of high-performance kernels.

Update MatmulScheduler class in matmul_tensorcore.py

The MatmulScheduler class in the matmul_tensorcore.py module has been updated to calculate the number of threads based on the block size and warp size. This ensures optimal GPU warp configuration for NVIDIA GPUs.

Improve test_general_matmul_tilelang_kernel.py

The test_general_matmul_tilelang_kernel.py module has been improved to include additional test cases and assertions for correctness.
@LeiWang1999
Copy link
Contributor Author

BUG Fix:

InjtectThreadSync Pass must be applied after the pass MergeSharedMemoryAllocation, because MergeSharedMemoryAllocation pass modifies the buffer region, altering the liveness domain of the buffer.

@LeiWang1999 LeiWang1999 merged commit 33d2fd6 into microsoft:main Oct 2, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant