Skip to content
Sherlock edited this page Mar 12, 2021 · 10 revisions

ORT Backend

User Interface

InferenceSession

  • Read inference_session.h to get familiar with the common interface functions,

Following are the important ones for training

- Load()
- Run()
- NewIOBinding()
- RegisterGraphTransformer()
- RegisterExecutionProvider()
- Advance: Initialize()

	- What happens under the hood when we call session.initalize()? 
  • Understand the configs in SessionOptions (session_option.h)

Followings are important ones for training

- execution_order
- enable_mem_pattern
- use_deterministic_compute
- session_log_severity_level

IOBinding (IOBinding.h)

  • Prerequisite: What is an ORTValue? (ml_value.h)

  • BindInput()

    • How to create an ORTValue?
  • BindOutput()

    • With preallocated buffer

    • Without preallocated buffer

      • Who allocates the output buffer? and How is this returned back to user?
  • What should be the lifespan of an IOBinding? Can user reuse IOBinding accross multiple Session::Run()?

  • How is binded inputs/outputs passed into ExecutionFrame?

  • Advance: How is IOBinding different from dlpack's approach? What are their advantages?

ORTModule (ortmodule.py)

  • Read training_agent.h to understand the available interface functions

  • Understand how ORTModule uses TrainingAgent

  • Read ORTModule forward() and backward() function

    • How does ORTModule get an ONNX graph from torch nn.module?
    • How is ORT doing the auto-diff without torch's autograd engine?
    • How is ORT hijacking torch's foward/backward call?

Advance: PyBind

  • How's C++ InferenceSession used as python's onnxruntime.InferenceSession?
  • Read onnxruntime_pybind_state.cc, onnxruntime_inference_collection.py for InferenceSession binding
  • Read orttraining_pybind_state.cc for TrainingAgent binding

Graph

Basic building blocks to describe a computation Graph

  • Node (graph.h/.cc)

    • What's the difference between an Op and a Node?

    • What are the common properties for a node?

      • Can a node's name be empty?
      • What's the identifier of a node in a graph? Index or Name?
    • Advance: Function Ops, node with FunctionBody

  • Graph (graph.h/.cc)

    • How to traverse from one node to another node?

    • What's the difference between a GraphInput and an Initializer?

    • Look for an example using GetProducerNode() and GetConsumerNodes()

    • What's the purpose of Graph::Resolve()?

      • How is ShapeAndTypeInference invoked?
  • NodeArg (node_arg.h)

    • What the relationship between a graph edge and a NodeArg?
    • What's the unique identifier of a NodeArg in a graph?
    • Action: Look for some example using Graph::GetOrCreateNodeArg() (You will need to use this at some point)

Graph Transformers

  • Understand the difference between GraphTransformer and RewriteRule

  • Understanding the purpose of GraphTransformerManager

    • How to register a set of graph transformers into a session?
  • Understanding the two versions of graph_transformer_utils.cc (onnxruntime/orttraining ones)

  • Get familiar with graph_utils.cc

  • Experiment with onnx.helper to compose a onnx model from the script (see transpose_matmul_gen.py for examples)

  • Action: Implement a graph transformer to get hands-on experience

Training Graph

  • Understand the workflow of training graph transformation

  • Understand GraphAugmenter (graph_augmenter.h/.cc)

  • GradientGraphBuilder

    • Understand the purpose/usage of STOP_GRADIENT_EDGES
    • Understand the meaning of x_node_args/y_node_args
    • Advance: Understand the back-propagation process in GradientGraphBuilder::Build()
  • Per Op GradientBuilder

    • Understand the Gradient Registry (gradient_builder_registry.cc)

    • Understand the Gradient Builder Declaration (gradient_builder.h)

    • Read a few examples in Gradient Builder Implementation (gradient_builder.cc)

      • Understand the shorthands of I, GI, O, GO (gradient_builder_base.h)

      • Understand how gradient subgraph is composed with existing ops, followings are good examples

        • Easy: GetDropoutGradient, GetSqrtGradient
        • Medium: GetAddSubGradient, GetMulGradient
        • Hard: GetMatMulGradient, GetGemmGradient
      • Understand how broadcasting is handled when building gradient graph GradientBuilderBase::HandleBroadcasting()

      • Action: Implement a gradient definition for an op to get hands-on experience

Op and Kernels

Understand the difference between Schema and Kernel

Onnx

  • Read onnx.proto and onnx-ml.proto and understand the design principle behind it

  • Get familiar with the Onnx Operaters: https://github.com/onnx/onnx/blob/master/docs/Operators.md

    • Must know: Dropout, MatMul, Gemm, Transpose, ReduceSum, Reshape
  • Understand the concept and purpose of opset, domain

    • When to use which?

      • onnx domain
      • msdomain
  • Understand the C++ data structure in onnx::TensorProto, onnx::AttributeProto, onnx::TypeProto

  • Understand how Shape and Type Inference works in the schema definition

  • Function Ops

Op Schema

  • Understand the difference among the following 3 sets of schema. When to use which?

    • Onnx's op Schema (onnx repo: defs.cc)

    • contrib ops (contrib_defs.cc)

      • Good to know: LayerNorm, Gelu
    • training ops (training_op_defs.cc)

  • Action: Add an op or update an op's schema to get hands-on experience

Op Kernels

  • Kernel Declaration and Registory

    • Understand when to use which registry for a kernel

    • Inference Kernels

      • Onnx Op Kernels

        • cpu_execution_provider.cc
        • cuda_execution_provider.cc
      • Contrib Op Kernels

        • cpu_contrib_kernels.cc
        • cuda_contrib_kernels.cc
        • Advance: rocm_contrib_kernels.cc
    • Training Kernels

      • CPU (cpu_training_kernels.cc)
      • CUDA (cuda_training_kernels.cc)
      • Advance: Rocm (rocm_training_kernels.cc)
  • Kernel Implementation

    • Tensor vs. OrtValue

      • Read tensor.h and ml_value.h
      • What's the difference between Tensor and OrtValue? Why we need two classes?
      • How to get a Tensor from OrtValue?
      • How to get data's raw pointer from a Tensor?
    • Kernel Definition

      • When to use Alias() and VariadicAlias()?
      • How to set TypeConstraint()?
      • When to use InputMemoryType?
    • CPU Kernel vs. CUDA Kernel

      • What does it mean to have a CPU input/output for a CUDA kernel?
      • Subtopic 3
      • Subtopic 4
    • Gradient Kernels

      • Examples

        • Easy: DropoutGrad, GeluGrad
        • Medium: GatherGrad
        • Hard: LayerNormalizationGrad
      • Understand how to write unit tests to check gradient's correctness

    • Understand how to user OpTester in UnitTest

    • Action: Implement a kernel to get hands-on experience

Performance Investigation

Profiling Tools

  • nvprof

    • try run with/without --print-gpu-summary
    • try --profile-child-processes
    • Action: profile a training run
  • Visual Profiler UI

    • Use ruler to measure a time span
    • Identify the top hitters in kernels
    • Compare two sets of profiling results to identify the performance gap
    • Can you identify the start/end of a train_step from the timeline view?
  • torch profiler

  • Linux perf

Subtopic 3

CUDA Kernels Optimization

ExecutionProvider

What is execution provider? What problems does it solve?(execution_provider.h)

CPU and CUDA are the most commonly used EPs in training (cpu/cuda_execution_provider.cc)

How to register execution provider into a session? or in ortmodule interface?

What's the functionality of ExecutionProvider::GetCapability()?

Execution Engine

InferenceSession::Run()

  • Read RunOptions and understand the options (run_option.h)

SequentialExecutor::Execute()

  • What's the purpose of ExecutionFrame? (execution_frame.h)

    • How is one nodes output passed in as another node's input?
    • What happens when we call context->Output() inside an op kernel?
    • How are feeds and fetches stored in ExecutionFrame?
  • How is the execution order determined? (graph_viewer.cc)

    • Default execution order uses Graph::ReverseDFS() to generated topological sort
    • Priority-based execution order uses Graph::KahnsTopologicalSort with per-node priority
  • How is each node's kernel invoked ?

  • How does ORT guarantees all the cuda kernel is completed before session.run return?

Advance: GraphPartitioner

  • How each node is determined to be place on which execution provider? (graph_partitioner.h)

Memory

BFCArena

  • Why we need an arena? What problem does it solve?

Memory Planning

Memory Pattern

  • How does ORT come up with a peak memory consumption?

External Torch's CUDACachingAllocator

  • How does ORTModule uses pytorch allocator? (ortmoduel.py)
  • Advance: What's the difference between BFCArena and CUDACachingAllocator?

CUDA Programming

CUDA programming basics

  • Understand the hardward

    • Architecture Generations

      • P100: Pascale / sm60
      • V100: Volta / sm 70
      • A100: Amper/ sm 80
    • CUDA Core vs. Tensor Core

  • Programming model

    • Thread
    • Block
    • Grid
    • Stream
  • Must known funnctions

    • cudaMalloc() vs. cudaFree()
    • cudaMemcpy() vs. cudaMemcpyAsync()
    • cudsMemset() vs. cudaMemsetAsync()
    • cudaStreamSynchronize() vs. cudaDeviceSynchronize()
    • cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

  • Avoid memcpy
  • Avoid unnecessary Sync
  • Preprocess data in CPU
  • when to use #pragma unroll?

CUDA Kernel Examples

  • Easy: Dropout/DropGrad
  • Medium: SoftmaxCrossEntropyLoss(Grad)
  • Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

  • printf() is working inside cuda code
  • Memcpy data to CPU for inspection?

Understanding IO bound and compute bound

Distributed Training

Prerequisites: NCCL

Good to know: MPI

Data Parallelism

  • Understand NCCLAllReduce
  • Get familiar with DDP usage/setup

Megatron

Zero

Mix of Experts

  • Understand All2All

Pipeline Parallelism

Model Training Domain Knowledge

ML Knowledge

  • Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
  • Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
  • Understand the difference between optimizers: SGD, Adam and LAMB
  • Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Know-hows

  • Familiar with running/monitoring AML experiments
  • Familiarize with setting up tensorboard
  • Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics

Convergence Investigation

  • Remove all randomness in the program

    • Set Seeds
    • Set Dropout Ratio to 0
    • Set use_deterministice_compute=True
  • Shrink the reproducible condition to the very minimal, as long as it can still repro

    • Use 1 layer model
    • Use smaller hidden_size
    • Use single GPU
    • ...
  • Common Tricks

    • Set the learning rate to 0 to disable model change
  • Advance: how to do hyper-parameter tuning to get the model to converge better?

Action: Train a model E2E to get hands-on experience

Know-hows

Conda

Docker

VScode

  • Setting up VScode with remote VM
  • Debugging within Vscode

Debugging with gdb / pdb

Common debugging Tricks

  • Getting the .onnx inference/training graph
  • Enable I/O Dump
  • Enable execution plan and memory plan dump
  • Enable CPU profiling dump
  • Enable CUDA memory consumption logs

Learning Roadmap

Basic

  • InferenceSession/ORTModule
  • Graph/Node/NodeArg
  • Onnx/Op/Schema/Kernel
  • ORTValue/Tensor
  • GraphTransformer
  • Per-op Gradient Building
  • Performance Investigation

Intermediate

  • ExecutionProvider
  • IOBinding/dlpack
  • PyBind
  • Gradient Graph Building
  • CUDA Programming

Advanced

  • Execution Engine

    • SessionState
    • ExecutionFrame
  • Memory

  • Distributed Training

Domain Experts

  • Performance optimization for CUDA kernels
  • Hyper-parameter tuning