Skip to content

Commit

Permalink
Tiny upstream pr (#1094)
Browse files Browse the repository at this point in the history
* Building xformers using ck-tiled as default

* ensure ck_decoder does not dispatch

* Add disable_on_rocm on some test scripts

* Update to test_mem_eff_attention.py

* apply isort

* apply black

* fix flake8 suggestions

* add license headers and reapply black

* Tiny update to rocm_ci.yml

* Add conditional compiling for cuda-depending codes in ROCM

* Update to benchmark scripts

* Rename the one script file

* Revert "Add conditional compiling for cuda-depending codes in ROCM"

This reverts commit 12fb41c.

* Update to scripts

* Change and add readme for tests and benchmarks

* Remove the stuffs for supporting old ck

* Remove old composable_kernel from submodule list

* Remove folder third_party/composable_kernel

* Rename the folder

* Remove unused script file

* apply black

* pacify mypy

* fix clang-format

* reapply black

* fix lints

* make test_splitk_reference run on cpu

* add ck modules to docs

* try fixing nvidia build by re-including sparse24 cpp folder into extension sources

* update cutlass to upstream commit

* update flash-attention to upstream commit

* simplify setup.py

* remove duplicate run_batched_infer_causalmask_attnbias_dispatched<f16, true, true, 128>

* add hip version and pytorch hip arch list to xformers build info

* fix build

* patch around the unhappy path in get_hip_version

* skip test_grad_checkpointing for triton_splitk since it doesn't have bwop

* re-enable test_mqa_forward since ck tiled is the current implementation

* make skip test_wrong_alignment more generic

* reapply black

* simplify test_decoder

* put python version check inside triton_splitk op

* fix logic

* cleanup python3.9 checks in tests

* cleanup test_attentions

* cleanup test_checkpoint as test running on cpu does not depend on gpu platform

* fix lints

* try fixing win build by conditional import of triton in triton op

* re-enable test_triton_layernorm as it passes

* re-enable test_triton_blocksparse as it passes

* cleanup test_sparse_tensors

* cleanup test_custom_ops

* reapply black

* cleanup test_core_attention

* benchmark ck ops on rocm only

* fix mypy

* fix lint: black

* fix lints: mypy

* split-k decoder: move all tunable parameters to the top of cpp file

* apply clang-format

* Rename HDim/headdim to MaxK/maxk

* Move some headers files to ck examples for later reusing

* Replace using qs_ks_vs pipeline by qr_ks_vs pipeline while HeadDim is 256 for better performance

* rm test_ck_7

* dump kernel resource usage to compilation logs similar to nv

* Add the c++ extension to the latest change of ck_tile/dev fwd kernel (added droppout)

* Add the c++ extension to use ck_tile/dev/ fmha bwd kernel

* Update to add dropout for fmah backward

* Update in attention.cpp to align efficient_attention_backward_ck interface parameters

* Enable BwdOp in ck.py

* Support grad_out to have different strides as out

* Force seqstart_q/seqstart_k to be in device memory in ck.py

* Remove duplicated codes in ck_tiled_fmha_grouped_forward.h/infer.h

* Use optimized async pipeline where 8x headdim length is assumed

* Fix in batched_infer

* Update to track ck_tile/opt_padding_fa_train_xformers branch

* Update rocm_ci.yml

configuring the self-hosted runner

* Update to use the newer FmhaFwdEpilogue

* Update rocm_ci.yml

add option to manually trigger workflow

* Update rocm_ci.yml

remove condition which skips ci unless github event contains string 'rocm'

* copy rocm_ci workflow from main branch

* Update rocm_ci.yml

Bump upload-artifact version

* Update to use the newer FmhaFwdEpilogue for grouped infer/forward

* Temporarily disable the using of QRKSVSAsync() pipeline

* Update rocm_ci.yml

add a daily run

* Implement the ck_rand_uniform interface for generating random number tensor

* Add dropout to the infer path (needed by xformers test_dropout)

* Update to support test_dropout and test_dropout_backward tests

* Update the padding method in batched_backward.h

* Update the OGradDotO kernel padding method

* Change the backward padding checking condition

* Add batch_stride_lse/d parameters to adapt grouped mode forward/backward to [num_batches, H, MaxSeqlenQ] layout

* Fill the grad_bias in advance

* Add support for kHasBiasGrad as instance template

* Remove using hdim_stride_do in fmha backward

* Force kPadSeqLenQ/kPadSeqLenK to be true in batched-backward to save compiling time

* Fix missing passing of {philox_seed, philox_offset} in inference path

* Use SimplifiedGenericAttentionMask to replace GenericAttentionMask

* Shorten the instance file names

* Rename the template parameters

* Simplify the names of the dispatch class and interfaces

* Changes to reuse the kernel files under ck_tile examples/91_tile_program/fmha folder

* Update test_mem_eff_attention.py for test_dropout/test_dropout_backward/test_backward on rocm

* Tiny change to the philox_cuda_state input setting

* Allocate logsumexp to ensure aligned access by each thread-group

* Add checking for query/key headdim size attention_backward_generic

* Using ck_tile/opt_padding_fa_train_pr2 and synchronize the backward codes with the changes

* Enable using async pipeline in the batched inference path for performance

* Re-organize cpp instances for calling fmha infer kernel

* Re-organize cpp instances for calling fmha forward kernel

* Re-organize cpp instances for calling fmha backward kernel

* Position the composable_kernel_tiled to ck_tile/opt_padding_fa_train branch

* Update to synchronize with the latest commits in ck_tile/opt_padding_fa_train

* update submodule to public

* Update to the criteria for padding seqlen_k in batched infer/forward

* Keep latest track of ck-tile commits

* Tiny fixing to the decoder including

* Position the ck-tiled to ck_tile/opt_padding branch

* Enable some attn_bias types which were previously disabled by old-ck in ck.py

* Add script generate_instances.py which helps to generate instances

* Simplify logic for seqstart_q/k

ROCm@566d26f has put the seqstart_k/q on device. So simplify the logic here.

The upstream xformers don't have this optmization and is copying the seqstart_q/k every iterations. We'd like this change to get in and then merge to upstream.

* Add Async pipeline to grouped mode inference path

* Use explict true for kPadSeqLenQ/kPadHeadDimQ/kPadHeadDimV templates for the Async pipeline

* Synchronize to the update of composable_kernel_tiled for better performance

* Update rocm_ci.yml - clean up dangling images after ci run

* Avoid unused-const-variable warning

Our compiler will error on unused-const-variable warning. So just fix this

* Tiny change in the BlockTile/Shape setting overriddings

* try to align fmha C++ extension to the ck_tile in ck develop branch

* Synchronize composable_kernel_tiled to latest ck develop

* Use FmhaFwdTilePartitioner_HBS only with seqlen_k padded cases

* Tiny fix/change to make test_forward/test_backward/test_dropout/test_dropout_backward_ck pass

* Fix compiling issue with regard to Invoker definitions in forward_decoder/forward_decoder_split operators

* Keep using -Woverloaded-virtual

* Fix clang-format for headers and cpp files

* Fix format in python scripts

* Add noqa: C801 for generate_instances.py

* Align dispatch_bw with main branch

* Align ops/fmha/common.py with main branch

* Synchronize the thirty-party/composable_kernel_tiled to latest ck_tile commits for better performance

* Relax the atol for test_forward and test_dropout due to the using of packed fp16_2_fp32 conversion in ck_tile

* Generate html report for tests run with rocm_ci.yml

* archive test results when tests have failed

* Always clean up dangling docker images in rocm_ci

* Bump python to 3.11 in rocm_ci.yml

* Disable flash attention tests rocm_ci.yml

Since the op is broken; tbd either make the op work, or disable it on ROCm

* Try to fix rocm_ci.yml

Init must be called before activation

* try to fix rocm_ci.yml flow by overriding PATH

* Fix setup.py path in rocm_ci.yml

* cd to xformers dir before running install in rocm_ci.yml

* Use pip to install xformers in rocm_ci.yml

* Possibly fix python version resolution in rocm_ci.yml

* Set the correct path for pytest in rocm_ci.yml

* remove test_reference_splitk as it was moved to a different file during the first upstream

remove test_mqa_forward from develop, as the test fails in develop and doesn't run upstream

remove reference attention splitk from the test file; it exists in test_splitk_reference

sync test_mem_eff_attention with upstream

* make sure ck operators have a name to be visible in the dispatcher

* fix sm version checks to happen only on CUDA, not ROCm

* (2/n) fix sm version checks to happen only on CUDA, not ROCm

* Remove _check_large_shapes checking in fmha/ck.py (#1067)

* make xformers install editable to fix cpp extensions detection

* Update to using the improved fmha-bwd (compiling passed)

* Update to get 80% of the test_backward and test_dropout_backward_ck cases passed

* Replace the using of ConvertGradQ by using torch tensor type converting

* Change the tile settings for MaxK=32

* Fix padding setting bug in grouped_backward

* Change -DCK_FMHA_FWD_FAST_EXP2=1 to -DCK_TILE_FMHA_FWD_FAST_EXP2=1

* Point the composable_kernel_tiled submodule to ck_tile/fa_bwd_opt branch

* Disable flshattF and flshattB on ROCM

* Add -mllvm and -enable-post-misched=0 compiling options for ROCM on setup.py

* Disable flshattF and flshattB on ROCM

* Update to support separate grad_q_f32_strides do to the API change in the fmd_bwd_kernel

* Use old method for setting BlockDropout due to the revert in fmha_fwd_kernel

* Tiny fix in grouped_backward

* Use packed tensor allocation for grad_q_f32

* Update to the ConvertGradQ kernel calling

* Tiny update

* Fix the parameter location in grouped_backward

* Adjust headdim128 tile shapes for better performance

* Update backward kernel calling due to adding of nhead_stride_dk/nhead_stride_dv parameters

* Synchronize with CK to use separate pipeline for kPadHeadDim true of false situtation

* Use convertDQ kernel

* Update to use unpadded lse layout

* Add explicit headdim256 instances for fmha backward

* Add leaked headdim256 instance references

* Change to generate.py and the re-generate the instance files using it

* Change to generate.py to generate instances refences and uses the generated reference headers

* Relax the RTOL of ckFwOp from 4e-4 to 3e-3 due to one big result case

* Change to use .h rather than .hpp as suffix for generated header files

* Fix in .gitignore

* Update to bwd setting to use only IGLP pipeline

* Synchronize to latest ck_tile fix and align the headdim64 tile shape setting

* Reformat the generated instances cpp files

* Fix to the backward Trait

* Set occupancy to -1 to avoid the compiling warning

* Revert "Set occupancy to -1 to avoid the compiling warning"

This reverts commit fa6d8b3.

* Add environment variable and compiler definition to control the generating of headdim256 instances

* Add --ignore-hd256 argument to generate_instance.py and some update in this script

* Add environment variable ENABLE_HIP_FMHA_RTN_BF16_CONVERT to enable using rtn bf16 conversion

* Remove commented lines in test_mem_eff_attention.py

* Synchronize to latest ck_tile commit

* apply black

* apply flake8

* fix mypy

* revert disable flash operator on rocm

* Synchronize to ck_tile latest commit again

* Re-position the composable_kernel submodule to the develop branch

* Avoid the Async pipeline when khasBias is true

* clang-format for two files

* Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts

* Change in generate_instances.py so that this scripts can be called from flexible location

* Add manual for generate_instances.py (.md)

* Modification in GENERATE_INSTANCES.md

* Fix in GENERATE_INSTANCES.md

* Update GENERATE_INSTANCES.md

* clean-up commented codes

* Revert "Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts"

This reverts commit 7a91589.

* Synchronize to latest ck develop for using the latest RTN bf16 convert

* Add c++ extension compiling options for better performance on ROCM 6.2

* Use the same rocm_ci.yml as upstream

* Use the same ck.py as upstream

* Reformat setup.py

---------

Co-authored-by: Max Podkorytov <[email protected]>
Co-authored-by: carlushuang <[email protected]>
Co-authored-by: Xiaodong Wang <[email protected]>
  • Loading branch information
4 people authored Sep 10, 2024
1 parent e984bf4 commit 0004c67
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 2 deletions.
8 changes: 7 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ def get_extensions():
cc_flag = ["-DBUILD_PYTHON_PACKAGE"]
use_rtn_bf16_convert = os.getenv("ENABLE_HIP_FMHA_RTN_BF16_CONVERT", "0")
if use_rtn_bf16_convert == "1":
cc_flag += ["-DCK_TILE_FLOAT_TO_BFLOAT16_DEFAULT=0"]
cc_flag += ["-DCK_TILE_FLOAT_TO_BFLOAT16_DEFAULT=3"]

arch_list = os.getenv("HIP_ARCHITECTURES", "native").split()

Expand All @@ -471,6 +471,12 @@ def get_extensions():
"-Woverloaded-virtual",
"-mllvm",
"-enable-post-misched=0",
"-mllvm",
"-amdgpu-early-inline-all=true",
"-mllvm",
"-amdgpu-function-calls=false",
"-mllvm",
"-greedy-reverse-local-assignment=1",
]
+ generator_flag
+ cc_flag,
Expand Down
2 changes: 1 addition & 1 deletion third_party/composable_kernel_tiled

0 comments on commit 0004c67

Please sign in to comment.