Home

ORT Backend

User Interface

InferenceSession

Read inference_session.h to get familiar with the common interface functions,

Following are the important ones for training

- Load()
- Run()
- NewIOBinding()
- RegisterGraphTransformer()
- RegisterExecutionProvider()
- Advance: Initialize()

	- What happens under the hood when we call session.initalize()?

Understand the configs in SessionOptions (session_option.h)

Followings are important ones for training

- execution_order
- enable_mem_pattern
- use_deterministic_compute
- session_log_severity_level

IOBinding (IOBinding.h)

Prerequisite: What is an ORTValue? (ml_value.h)
BindInput()
- How to create an ORTValue?
BindOutput()
- With preallocated buffer
- Without preallocated buffer
  - Who allocates the output buffer? and How is this returned back to user?
What should be the lifespan of an IOBinding? Can user reuse IOBinding accross multiple Session::Run()?
How is binded inputs/outputs passed into ExecutionFrame?
Advance: How is IOBinding different from dlpack's approach? What are their advantages?

ORTModule (ortmodule.py)

Read training_agent.h to understand the available interface functions
Understand how ORTModule uses TrainingAgent
Read ORTModule forward() and backward() function
- How does ORTModule get an ONNX graph from torch nn.module?
- How is ORT doing the auto-diff without torch's autograd engine?
- How is ORT hijacking torch's foward/backward call?

Advance: PyBind

How's C++ InferenceSession used as python's onnxruntime.InferenceSession?
Read onnxruntime_pybind_state.cc, onnxruntime_inference_collection.py for InferenceSession binding
Read orttraining_pybind_state.cc for TrainingAgent binding

Graph

Basic building blocks to describe a computation Graph

Node (graph.h/.cc)
- What's the difference between an Op and a Node?
- What are the common properties for a node?
  - Can a node's name be empty?
  - What's the identifier of a node in a graph? Index or Name?
- Advance: Function Ops, node with FunctionBody
Graph (graph.h/.cc)
- How to traverse from one node to another node?
- What's the difference between a GraphInput and an Initializer?
- Look for an example using GetProducerNode() and GetConsumerNodes()
- What's the purpose of Graph::Resolve()?
  - How is ShapeAndTypeInference invoked?
NodeArg (node_arg.h)
- What the relationship between a graph edge and a NodeArg?
- What's the unique identifier of a NodeArg in a graph?
- Action: Look for some example using Graph::GetOrCreateNodeArg() (You will need to use this at some point)

Graph Transformers

Understand the difference between GraphTransformer and RewriteRule
Understanding the purpose of GraphTransformerManager
- How to register a set of graph transformers into a session?
Understanding the two versions of graph_transformer_utils.cc (onnxruntime/orttraining ones)
Get familiar with graph_utils.cc
Experiment with onnx.helper to compose a onnx model from the script (see transpose_matmul_gen.py for examples)
Action: Implement a graph transformer to get hands-on experience

Training Graph

Understand the workflow of training graph transformation
Understand GraphAugmenter (graph_augmenter.h/.cc)
GradientGraphBuilder
- Understand the purpose/usage of STOP_GRADIENT_EDGES
- Understand the meaning of x_node_args/y_node_args
- Advance: Understand the back-propagation process in GradientGraphBuilder::Build()
Per Op GradientBuilder
- Understand the Gradient Registry (gradient_builder_registry.cc)
- Understand the Gradient Builder Declaration (gradient_builder.h)
- Read a few examples in Gradient Builder Implementation (gradient_builder.cc)
  - Understand the shorthands of I, GI, O, GO (gradient_builder_base.h)
  - Understand how gradient subgraph is composed with existing ops, followings are good examples
    - Easy: GetDropoutGradient, GetSqrtGradient
    - Medium: GetAddSubGradient, GetMulGradient
    - Hard: GetMatMulGradient, GetGemmGradient
  - Understand how broadcasting is handled when building gradient graph GradientBuilderBase::HandleBroadcasting()
  - Action: Implement a gradient definition for an op to get hands-on experience

Op and Kernels

Understand the difference between Schema and Kernel

Onnx

Read onnx.proto and onnx-ml.proto and understand the design principle behind it
Get familiar with the Onnx Operaters: https://github.com/onnx/onnx/blob/master/docs/Operators.md
- Must know: Dropout, MatMul, Gemm, Transpose, ReduceSum, Reshape
Understand the concept and purpose of opset, domain
- When to use which?
  - onnx domain
  - msdomain
Understand the C++ data structure in onnx::TensorProto, onnx::AttributeProto, onnx::TypeProto
Understand how Shape and Type Inference works in the schema definition
Function Ops

Op Schema

Understand the difference among the following 3 sets of schema. When to use which?
- Onnx's op Schema (onnx repo: defs.cc)
- contrib ops (contrib_defs.cc)
  - Good to know: LayerNorm, Gelu
- training ops (training_op_defs.cc)
Action: Add an op or update an op's schema to get hands-on experience

Op Kernels

Kernel Declaration and Registory
- Understand when to use which registry for a kernel
- Inference Kernels
  - Onnx Op Kernels
    - cpu_execution_provider.cc
    - cuda_execution_provider.cc
  - Contrib Op Kernels
    - cpu_contrib_kernels.cc
    - cuda_contrib_kernels.cc
    - Advance: rocm_contrib_kernels.cc
- Training Kernels
  - CPU (cpu_training_kernels.cc)
  - CUDA (cuda_training_kernels.cc)
  - Advance: Rocm (rocm_training_kernels.cc)
Kernel Implementation
- Tensor vs. OrtValue
  - Read tensor.h and ml_value.h
  - What's the difference between Tensor and OrtValue? Why we need two classes?
  - How to get a Tensor from OrtValue?
  - How to get data's raw pointer from a Tensor?
- Kernel Definition
  - When to use Alias() and VariadicAlias()?
  - How to set TypeConstraint()?
  - When to use InputMemoryType?
- CPU Kernel vs. CUDA Kernel
  - What does it mean to have a CPU input/output for a CUDA kernel?
  - Subtopic 3
  - Subtopic 4
- Gradient Kernels
  - Examples
    - Easy: DropoutGrad, GeluGrad
    - Medium: GatherGrad
    - Hard: LayerNormalizationGrad
  - Understand how to write unit tests to check gradient's correctness
- Understand how to user OpTester in UnitTest
- Action: Implement a kernel to get hands-on experience

Performance Investigation

Profiling Tools

nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
torch profiler
Linux perf

Subtopic 3

CUDA Kernels Optimization

ExecutionProvider

What is execution provider? What problems does it solve?(execution_provider.h)

CPU and CUDA are the most commonly used EPs in training (cpu/cuda_execution_provider.cc)

How to register execution provider into a session? or in ortmodule interface?

What's the functionality of ExecutionProvider::GetCapability()?

Execution Engine

InferenceSession::Run()

Read RunOptions and understand the options (run_option.h)

SequentialExecutor::Execute()

What's the purpose of ExecutionFrame? (execution_frame.h)
- How is one nodes output passed in as another node's input?
- What happens when we call context->Output() inside an op kernel?
- How are feeds and fetches stored in ExecutionFrame?
How is the execution order determined? (graph_viewer.cc)
- Default execution order uses Graph::ReverseDFS() to generated topological sort
- Priority-based execution order uses Graph::KahnsTopologicalSort with per-node priority
How is each node's kernel invoked ?
How does ORT guarantees all the cuda kernel is completed before session.run return?

Advance: GraphPartitioner

How each node is determined to be place on which execution provider? (graph_partitioner.h)

Memory

BFCArena

Why we need an arena? What problem does it solve?

Memory Planning

Memory Pattern

How does ORT come up with a peak memory consumption?

External Torch's CUDACachingAllocator

How does ORTModule uses pytorch allocator? (ortmoduel.py)
Advance: What's the difference between BFCArena and CUDACachingAllocator?

CUDA Programming

CUDA programming basics

Understand the hardward
- Architecture Generations
  - P100: Pascale / sm60
  - V100: Volta / sm 70
  - A100: Amper/ sm 80
- CUDA Core vs. Tensor Core
Programming model
- Thread
- Block
- Grid
- Stream
Must known funnctions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudsMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

Avoid memcpy
Avoid unnecessary Sync
Preprocess data in CPU
when to use #pragma unroll?

CUDA Kernel Examples

Easy: Dropout/DropGrad
Medium: SoftmaxCrossEntropyLoss(Grad)
Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

printf() is working inside cuda code
Memcpy data to CPU for inspection?

Understanding IO bound and compute bound

Distributed Training

Prerequisites: NCCL

Read https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html

Good to know: MPI

Good read: https://mpitutorial.com/tutorials/

Data Parallelism

Understand NCCLAllReduce
Get familiar with DDP usage/setup

Megatron

Read https://arxiv.org/abs/1909.08053

Zero

Read https://arxiv.org/abs/1910.02054
Zero-1
- Understand ReduceScatter/AllGather
- Understand how optimizer state is partitioned
Zero-2
Zero-3

Mix of Experts

Understand All2All

Pipeline Parallelism

Model Training Domain Knowledge

ML Knowledge

Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
Understand the difference between optimizers: SGD, Adam and LAMB
Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Know-hows

Familiar with running/monitoring AML experiments
Familiarize with setting up tensorboard
Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics

Convergence Investigation

Remove all randomness in the program
- Set Seeds
- Set Dropout Ratio to 0
- Set use_deterministice_compute=True
Shrink the reproducible condition to the very minimal, as long as it can still repro
- Use 1 layer model
- Use smaller hidden_size
- Use single GPU
- ...
Common Tricks
- Set the learning rate to 0 to disable model change
Advance: how to do hyper-parameter tuning to get the model to converge better?

Action: Train a model E2E to get hands-on experience

Know-hows

Conda

Docker

VScode

Setting up VScode with remote VM
Debugging within Vscode

Debugging with gdb / pdb

Common debugging Tricks

Getting the .onnx inference/training graph
Enable I/O Dump
Enable execution plan and memory plan dump
Enable CPU profiling dump
Enable CUDA memory consumption logs

Learning Roadmap

Basic

InferenceSession/ORTModule
Graph/Node/NodeArg
Onnx/Op/Schema/Kernel
ORTValue/Tensor
GraphTransformer
Per-op Gradient Building
Performance Investigation

Intermediate

ExecutionProvider
IOBinding/dlpack
PyBind
Gradient Graph Building
CUDA Programming

Advanced

Execution Engine
- SessionState
- ExecutionFrame
Memory
Distributed Training

Domain Experts

Performance optimization for CUDA kernels
Hyper-parameter tuning

Please use the learning roadmap on the home wiki page for building general understanding of ORT.

Home

ORT Backend

User Interface

InferenceSession

IOBinding (IOBinding.h)

ORTModule (ortmodule.py)

Advance: PyBind

Graph

Basic building blocks to describe a computation Graph

Graph Transformers

Training Graph

Op and Kernels

Understand the difference between Schema and Kernel

Onnx

Op Schema

Op Kernels

Performance Investigation

Profiling Tools

Subtopic 3

CUDA Kernels Optimization

ExecutionProvider

What is execution provider? What problems does it solve?(execution_provider.h)

CPU and CUDA are the most commonly used EPs in training (cpu/cuda_execution_provider.cc)

How to register execution provider into a session? or in ortmodule interface?

What's the functionality of ExecutionProvider::GetCapability()?

Execution Engine

InferenceSession::Run()

SequentialExecutor::Execute()

Advance: GraphPartitioner

Memory

BFCArena

Memory Planning

Memory Pattern

External Torch's CUDACachingAllocator

CUDA Programming

CUDA programming basics

Common tricks

CUDA Kernel Examples

Debugging CUDA kernels

Understanding IO bound and compute bound

Distributed Training

Prerequisites: NCCL

Good to know: MPI

Data Parallelism

Megatron

Zero

Mix of Experts

Pipeline Parallelism

Model Training Domain Knowledge

ML Knowledge

Know-hows

Convergence Investigation

Action: Train a model E2E to get hands-on experience

Know-hows

Conda

Docker

VScode

Debugging with gdb / pdb

Common debugging Tricks

Learning Roadmap

Basic

Intermediate

Advanced

Domain Experts

Navigation by topic

Upcoming Release Roadmap

Glossary

Development

Common Tasks

Dependencies

Core Architecture

Feature Details

Inferencing

Training

Clone this wiki locally