Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding #124

Merged
merged 133 commits into from
Aug 4, 2024

Conversation

LeiWang1999
Copy link
Contributor

This pull request includes several changes to improve the handling of decoding and memory prefetching in the GPU intrinsics and matmul dequantization logic. The key changes involve adding support for offset handling in decoding functions, updating the pass context for TVM transformations, and enhancing the scheduling logic for shared memory prefetching.

Decoding Enhancements:

  • Added new decoding functions decode_i4_to_f16_scale_offset, decode_i4s_to_f16_scale_offset, and decode_i4u_to_f16_scale_offset to handle offset during decoding (bitblas/gpu/intrin/lop3.py).
  • Updated get_fast_decode_intrin to append _offset to function names when storage_scope is "warp" and scaling is enabled (bitblas/gpu/intrin/lop3.py).

TVM Transformation Context:

  • Modified the tvm_callback_cuda_postproc function to include "tir.disable_cse_tir": True in the TVM pass context configuration (bitblas/base/utils.py).

Scheduling and Prefetching:

  • Enhanced shared memory prefetching logic to handle different reduction depths and weight transform kinds in sch_shared_memory_prefetch_with_config (bitblas/gpu/matmul_mma_dequantize.py).
  • Updated the get_param_indices and fetch_to_shared functions to support reduction across threads (bitblas/gpu/matmul_mma_dequantize.py). [1] [2]

Intrinsic Definitions:

  • Added new intrinsic definitions for warp scope in intrin_definitions to support various configurations (bitblas/gpu/intrin/lop3.py).

These changes collectively enhance the functionality and performance of the GPU intrinsics and matmul dequantization processes.

@LeiWang1999
Copy link
Contributor Author

Also fixed codeql warning ref to #121

@LeiWang1999
Copy link
Contributor Author

Add GPTQ Repack Test to checkout the integration correctness.

@LeiWang1999 LeiWang1999 merged commit 164d1ab into microsoft:main Aug 4, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant