-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Performance Investigation
- Make sure you are using RelWithDebInfo build. (Debug build is significantly slower and shouldn't be used for benchmarking or performance investigation)
- Turn on logging by setting
self._verbosity = Verbosity.INFO (or even Verbosity.VERBOSE)
in ortmodule.py - Turn on model dumping by setting
self._save_onnx = True
andself._save_onnx_prefix = '<MODEL NAME>'
in ortmodule.py - Turn on dumping optimized graph by adding
session_options.optimized_model_filepath = '<MODEL NAME>_optimized'
in ortmodule.py
Hint
For better visualization in Netron, you can shrink the model to 1 or 2 layer.
Notice
*_inference.onnx
is onnx model directly coming out from the exporter without any graph transformation.
*_inference_optimized.onnx
(need to dump this)
*_training.onnx
is the training graph built on top of *_inference_optimized.onnx graph
*_optimized.onnx
is the final optimized training graph, the actual graph executed by the execution engine.
-
Excessive memcpy nodes
Search for 'memcpy' in the*_optimized.onnx
. In the ideal case, there should zero memcpy node in the final optimized training graph.- If the CUDA kernel is missing for an op, you will commonly see a node sandwiched by MemcpyToHost and MemcpyFromHost nodes
- If the producer node and consumer node expect the tensor to be in different device, a memcpy node will be inserted between the nodes
-
CUDA Kernel Missing for an Op This can be usually discovered by the following methods:
- Look for the logs with following pattern
CUDA Kernel not found in registries for Op type: Clip node name: Clip_150
- Look for a node sandwiched by MemcpyToHost and MemcpyFromHost in the
*_optimized.onnx
graph
- Look for the logs with following pattern
-
Missing Graph Transformers
ORT is using pattern matching to look for opportunities to apply graph transformations. If the graph is different from the coded pattern, the graph transformation may fail to kick in.- Look for (Simplified)LayerNormalization in
*_inference_optimized.onnx
graph. The layernorm subgraph should be fused into a single node. - Look for (Fast)Gelu in
*_inference_optimized.onnx
graph. The gelu subgraph should be fused into a single node. - Look for stand-alone MatMul nodes in
*_optimized.onnx
graph. Most of the MatMuls should have been fused with leading Transpose/Scale into FusedMatMul nodes, or fused with following Add into Gemm nodes. Examine the unfused MatMul nodes to see if should have been fuse with surrounding ops. - Look for stand-alone Dropout node in
*_optimized.onnx
graph. Examine whether it should be fused with surrounding Add ops into BiasDropout node. - Look for stand-alone Softmax node in
*_optimized.onnx
graph. Examine whether it should be fused with the leading Add ops into BiasSoftmax node.
- Look for (Simplified)LayerNormalization in
-
nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
-
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
-
torch profiler
-
Linux perf
Please use the learning roadmap on the home wiki page for building general understanding of ORT.