Skip to content

Latest commit

 

History

History
68 lines (52 loc) · 5.99 KB

README.md

File metadata and controls

68 lines (52 loc) · 5.99 KB

Section 5 - Example Vector Designs

The programming examples are a number of sample designs that further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI.

Simplest

Passthrough

The passthrough example is the simplest "getting started" example. It copies 4096 bytes from the input to output using vectorized loads and stores. The design example shows a typical project organization which is easy to reproduce with other examples. There are only really 4 important files here.

  1. aie2.py The AIE structural design which includes the shim tile connected to the external memory, and a single AIE core for performing the copy. It also shows a simple use of the ObjectFIFOs described in section 2.
  2. passthrough.cc This is a C++ file which performs the vectorized copy operation.
  3. test.cpp or test.py A C++ or Python main application for exercising the design, and comparing against a CPU reference
  4. Makefile A Makefile documenting (and implementing) the build process for the various artifacts.

The passthrough DMAs example shows an alternate method of performing a copy without involving the cores, and instead performing a loopback.

Basic

Design name Data type Description
Vector Scalar Add i32 Adds 1 to every element in vector
Vector Scalar Mul i32 Returns a vector multiplied by a scale factor
Vector Vector Add i32 Returns a vector summed with another vector
Vector Vector Modulo i32 Returns vector % vector
Vector Vector Multiply i32 Returns a vector multiplied by a vector
Vector Reduce Add bfloat16 Returns the sum of all elements in a vector
Vector Reduce Max bfloat16 Returns the maximum of all elements in a vector
Vector Reduce Min bfloat16 Returns the minimum of all elements in a vector
Vector Exp bfloat16 Returns a vector representing ex of the inputs
DMA Transpose i32 Transposes a matrix with the Shim DMA using npu_dma_memcpy_nd
Matrix Scalar Add i32 Returns a matrix multiplied by a scalar
Single core GEMM bfloat16 A single core matrix-matrix multiply
Multi core GEMM bfloat16 A matrix-matrix multiply using 16 AIEs with operand broadcast. Uses a simple "accumulate in place" strategy
GEMV bfloat16 A vector-matrix multiply returning a vector

Machine Learning Kernels

Design name Data type Description
Eltwise Add bfloat16 An element by element addition of two vectors
Eltwise Mul i32 An element by element multiplication of two vectors
ReLU bfloat16 Rectified linear unit (ReLU) activation function on a vector
Softmax bfloat16 Softmax operation on a matrix
Conv2D i8 A single core 2D convolution for CNNs
Conv2D+ReLU i8 A Conv2D with a ReLU fused at the vector register level

Exercises

  1. Can you modify the passthrough design to copy more (or less) data?

  2. Take a look at the testbench in our Vector Exp example test.cpp. Take note of the data type and the size of the test vector. What do you notice?

  3. What is the communication-to-computation ratio in ReLU?

  4. HARD Which basic example is a component in Softmax?


[Prev - Section 4] [Top] [Next - Section 6]