We will support different attention approaches. Candle provides us a brought varierty on existing implementations.

Type	From where
SelfAttention	Integrated- here for one input sequence. Depending on the implementation it is also called a dot product attention or global attention.
CrossAttention (aka Co-Attention)	Not Integrated so far - here for multiple input sequences
CausalSelfAttention	Not Integrated so far - here for parts of one or multiple input sequences e.g., only all token before the present. Depending on the implementation also called local attention.
MultiHeadAttention	Not Integrated so far - here for multiple concerns/ questions
MultiQueryAttention	Not Integrated so far - here for multiple concerns/ questions but knowing the other concerns/ questions
GroupQueryAttention	Not Integrated so far - here for building logical groups between the questions

Terms:

Heads: amount on parallel questions on a given stream.
Contexts: amount of parallel streams.
Temporal: Time.
Spatial: Dimensionality.

Note: All attention should be available for multiple dimensions. This includes spatial transformer which acts in >= 2D space (=spatial) as required for CNN applications.

More complex models mappes as own layer:

https://arxiv.org/html/2312.06635v3
- here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention.MD

attention.MD

Files

attention.MD

Latest commit

History

attention.MD

File metadata and controls