Section 5 - Example Vector Designs

The programming examples are a number of sample designs that further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI.

Simplest

Passthrough

The passthrough example is the simplest "getting started" example. It copies 4096 bytes from the input to output using vectorized loads and stores. The design example shows a typical project organization which is easy to reproduce with other examples. There are only really 4 important files here.

aie2.py The AIE structural design which includes the shim tile connected to the external memory, and a single AIE core for performing the copy. It also shows a simple use of the ObjectFIFOs described in section 2.
passthrough.cc This is a C++ file which performs the vectorized copy operation.
test.cpp or test.py A C++ or Python main application for exercising the design, and comparing against a CPU reference
Makefile A Makefile documenting (and implementing) the build process for the various artifacts.

The passthrough DMAs example shows an alternate method of performing a copy without involving the cores, and instead performing a loopback.

Basic

Design name	Data type	Description
Vector Scalar Add	i32	Adds 1 to every element in vector
Vector Scalar Mul	i32	Returns a vector multiplied by a scale factor
Vector Vector Add	i32	Returns a vector summed with another vector
Vector Vector Modulo	i32	Returns vector % vector
Vector Vector Multiply	i32	Returns a vector multiplied by a vector
Vector Reduce Add	bfloat16	Returns the sum of all elements in a vector
Vector Reduce Max	bfloat16	Returns the maximum of all elements in a vector
Vector Reduce Min	bfloat16	Returns the minimum of all elements in a vector
Vector Exp	bfloat16	Returns a vector representing e^x of the inputs
DMA Transpose	i32	Transposes a matrix with the Shim DMA using `npu_dma_memcpy_nd`
Matrix Scalar Add	i32	Returns a matrix multiplied by a scalar
Single core GEMM	bfloat16	A single core matrix-matrix multiply
Multi core GEMM	bfloat16	A matrix-matrix multiply using 16 AIEs with operand broadcast. Uses a simple "accumulate in place" strategy
GEMV	bfloat16	A vector-matrix multiply returning a vector

Machine Learning Kernels

Design name	Data type	Description
Eltwise Add	bfloat16	An element by element addition of two vectors
Eltwise Mul	i32	An element by element multiplication of two vectors
ReLU	bfloat16	Rectified linear unit (ReLU) activation function on a vector
Softmax	bfloat16	Softmax operation on a matrix
Conv2D	i8	A single core 2D convolution for CNNs
Conv2D+ReLU	i8	A Conv2D with a ReLU fused at the vector register level

Exercises

Can you modify the passthrough design to copy more (or less) data?
Take a look at the testbench in our Vector Exp example test.cpp. Take note of the data type and the size of the test vector. What do you notice?
What is the communication-to-computation ratio in ReLU?
HARD Which basic example is a component in Softmax?

[Prev - Section 4] [Top] [Next - Section 6]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Section 5 - Example Vector Designs

Simplest

Passthrough

Basic

Machine Learning Kernels

Exercises

Files

README.md

Latest commit

History

README.md

File metadata and controls

Section 5 - Example Vector Designs

Simplest

Passthrough

Basic

Machine Learning Kernels

Exercises