Skip to content

Latest commit

 

History

History
253 lines (145 loc) · 19.7 KB

CHANGELOG.md

File metadata and controls

253 lines (145 loc) · 19.7 KB

Changelog

0.1.5 (2024-08-13)

Bugfix

  • Fix PagedPrefill python api and some typos (#441) (3fff008)
  • fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)

Features

  • decouple float and int workspace buffer (#442) (a7ee566)

Performance Improvements

  • faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.

0.1.4 (2024-08-09)

Features

Bug Fixes

  • fix dispatch fp16 type when enable fp8 (#430) (daa5566)
  • improve numerical stability of sampling kernels (#429) (898d8ea)

Other improvements

  • break up _kernels into multiple modules (#428) (8e482d9)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @esmeetu, @LiuXiaoxuanPKU, @peng1999, @xslingcn, @Yard1, @zhyncs.

0.1.3 (2024-07-31)

Bugfix

  • bugfix: Fix cudagraph mode of BatchPrefillWithRaggedKVCacheWrapper (#412) (9907bc)
  • fix cu118 cub usage for sampling kernels (#410) (58d359)

MiscBreak up _kernels into multiple modules

  • enhance allocator error info and add shape check for prefill begin forward functions (#413) (5e36c5)

0.1.2 (2024-07-29)

Bugfix

Features

Performance Improvements

0.1.1 (2024-07-20)

Bugfix

  • fix the invalid kernel configuration for architectures with small shared memory size (#385) (cdac57)

Features

  • expose decoupled kv-cache to pytorch api (#383) (457a0ae)

Performance Improvements

0.1.0 (2024-07-17)

Features

  • Add mask to merge_state_in_place (#372) (e14fa81)
  • expose pytorch api for block sparse attention (#375) (4bba6fa)
  • Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)

0.0.9 (2024-07-12)

Bugfix

  • fix the decode kernel segfault in cudagraph mode (#368)(c69cfa)
  • fix decode kernels output for empty kv cache (#363)(ac72b1)
  • check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)

Performance Improvements

Acknowledgement

We thank @Yard1, @Ying1123 and @zhyncs for their contributions.

0.0.8 (2024-07-03)

Bugfix

  • fix prefill/append kernel behavior for empty kv-cache (#353) (7adc8c)
  • fix decode attention kernel with logits cap (#350) (f5f7a2)

0.0.7 (2024-06-28)

Breaking Changes

  • batch_decode_with_padded_kv_cache was removed, we encourage user to use BatchDecodeWithPagedKVCacheWrapper instead. (#343)

Bugfix

  • fix the forward_return_lse function in BatchPrefillWithRaggedKVCache class (#337)
  • fix the scheduler behavior of large page size (#333)

Features

Performance Improvements

0.0.6 (2024-06-21)

Bugfix

Fix some bug in v0.0.5 that might lead to crashes and instable performance.

Performance Improvements

  • use 1x4 warp layout for small query length (#322) (4e89b4d)

0.0.5 (2024-06-20)

Highlights

Acknowledgement

We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.

Refactor

  • support any GQA group size for tensor-cores kernels (#301) (c111ca)
  • support any page size for tensor-cores kernels (#306) (82fd8c)

Features

  • add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
  • add group gemm operators (#282) (e08ba42)
  • initial support of distributed operators (#289) (03553da)
  • initial support of logits hook (#298) (ab1e2ad)
  • Separate Q and KV dtypes for decode (#286) (5602659)
  • support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
  • support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
  • support custom attention mask in prefill/append attention kernels (#266) (7304282)
  • fused speculative sampilng kernels (#259) (cea2bb)
  • expose sampling APIs in pytorch (#238) (092902)

Performance Improvements

0.0.4 (2024-05-01)

Features

  • pytorch 2.3 support
  • gpu sampling kernels (top-p, top-k)
  • more gqa group sizes
  • add mma instructions for fp8 (#179) (d305798)
  • mma rowsum for fp8 (#180) (5af935c)
  • support any num_heads for get_alibi_slope (#200) (b217a6f)

Bug Fixes

  • fix python package dispatch error message (#182) (8eed01c)

0.0.3 (2024-03-08)

Features

Bug Fixes

Misc

  • add stream argument in BeginForwardFunction of TVMWrapper (#164) (fabfcb5)

Performance Improvements

  • multiple q by sm_scale in decode kernels (#144) (660c559)

0.0.2 (2024-02-17)

Bug Fixes