Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] feat: generate cpu kernels using KA #136

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

avik-pal
Copy link
Member

Currently, this is a performance disaster. Locally, I see like slowdowns of atleast 5-10x. Let's see the numbers on the dedicated benchmarks.

The main pro of this approach is that the maintenance burden significantly goes down. Now how can we solve this? (Probably this is better off as a KA Issue)

Finer control of CPU backend from KA:

  1. Allowing us to control the number of threads (if any): [FR] Add nthreads argument to CPU backend JuliaGPU/KernelAbstractions.jl#507
  2. @simd and @simd ivdep loop info. Either by default or by supplying to the backend object -- Make CPU loops simd & ivdep JuliaGPU/KernelAbstractions.jl#436
  3. Alternate threading: KA is being used inside "core" operations. As such we are unlikely (if not impossible) to call other operations that make use of threading. Hence, having the option to use "cheaper threads" (Polyester.jl) would be a great addition

@avik-pal avik-pal force-pushed the ap/ka_cpu branch 2 times, most recently from addd437 to b57f2a1 Compare August 20, 2024 01:59
Copy link

codecov bot commented Aug 20, 2024

Codecov Report

Attention: Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.

Project coverage is 68.91%. Comparing base (c185f04) to head (f056a2d).

Files Patch % Lines
src/impl/batchnorm.jl 0.00% 6 Missing ⚠️
src/impl/groupnorm.jl 0.00% 4 Missing ⚠️
src/impl/normalization.jl 50.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (c185f04) and HEAD (f056a2d). Click for more details.

HEAD has 19 uploads less than BASE
Flag BASE (c185f04) HEAD (f056a2d)
37 18
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #136       +/-   ##
===========================================
- Coverage   83.93%   68.91%   -15.02%     
===========================================
  Files          37       36        -1     
  Lines        1867     1586      -281     
===========================================
- Hits         1567     1093      -474     
- Misses        300      493      +193     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: f056a2d Previous: c185f04 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7729 ns 6083 ns 1.27
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5584 ns 5417 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7583 ns 8021 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6229.5 ns 6146 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 124269 ns 120417 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2920363 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 865375 ns 812042 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 420534 ns 424375 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 10250 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10375.5 ns 9917 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10729.5 ns 10125 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9917 ns 11792 ns 0.84
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 534855 ns 556460 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17679999 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2599750 ns 2542833 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 688785 ns 686027 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1417 ns 1500 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3187 ns 2792 ns 1.14
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1625 ns 1708.5 ns 0.95
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 2917 ns 1583 ns 1.84
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21391 ns 22218 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1301041.5 ns
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 199625 ns 205792 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31370 ns 29920 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3750 ns 3542 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3834 ns 4209 ns 0.91
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4125 ns 4271 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4187 ns 4229 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 141714.5 ns 148035 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8405798 ns
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1476354 ns 1621188 ns 0.91
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 149951.5 ns 151742 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2939917 ns 58542 ns 50.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1600958.5 ns 46375 ns 34.52
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2726854 ns 46584 ns 58.54
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 4035979 ns 83708 ns 48.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36748 ns 37608 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 561012.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1015666 ns 1081917 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 81321 ns 84866 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5227041.5 ns 2027833 ns 2.58
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3104542 ns 2085458 ns 1.49
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4209562.5 ns 2090292 ns 2.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 7866125 ns 1999000 ns 3.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 221458.5 ns 233327.5 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8196194 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8040333 ns 7717583 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1091659 ns 1460226 ns 0.75
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 169375.5 ns 145375 ns 1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 173374.5 ns 147458 ns 1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 157042 ns 150584 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 183708.5 ns 170437.5 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166229.5 ns 166412 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7336842 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1527917 ns 1615604.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 180901 ns 202872 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1109479.5 ns 1119083.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1112791 ns 1109000 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1120229 ns 1118458 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1106354.5 ns 1116145.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 672058.5 ns 707978 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37159528 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6453167 ns 5932000 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 918327 ns 1046946 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4375 ns 5104.5 ns 0.86
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4666.5 ns 4250 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5334 ns 5895.5 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5937.5 ns 5624.5 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 88193 ns 93783.5 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5394921 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 618250 ns 721583.5 ns 0.86
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 70180 ns 70761 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8958 ns 8875 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8792 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9375 ns 9083 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8750 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 579885 ns 603451.5 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 37743853 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5811854 ns 6400917 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388183 ns 388649.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 576020.5 ns 20083 ns 28.68
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 323333 ns 18812.5 ns 17.19
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421833 ns 20958 ns 20.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 871208.5 ns 18000 ns 48.40
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 67844 ns 68784 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3055826 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1366916.5 ns 1334334 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 74841 ns 83861 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 758875 ns 224229.5 ns 3.38
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 461208 ns 219416 ns 2.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 697312.5 ns 219062.5 ns 3.18
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1108021 ns 212958 ns 5.20
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 348707.5 ns 360915.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 12489751.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5797395.5 ns 5929666 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 481369 ns 478315 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 791 ns 709 ns 1.12
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 917 ns 1041 ns 0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20510 ns 21396 ns 0.96
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1150889.5 ns
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 286334 ns 303750 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 33230 ns 32981 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1459 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1625 ns 1542 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1542 ns 0.89
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 122441.5 ns 127634 ns 0.96
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8825165 ns
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1492875 ns 1626542 ns 0.92
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 127486 ns 138112 ns 0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 414458 ns 7333 ns 56.52
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 239375 ns 6125 ns 39.08
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 502000 ns 6125 ns 81.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 550250 ns 10333 ns 53.25
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23668 ns 24384 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1312160 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 596958 ns 700271 ns 0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47410 ns 46841 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 723229 ns 221166 ns 3.27
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 454209 ns 238834 ns 1.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 857167 ns 230666 ns 3.72
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1019917 ns 251250 ns 4.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 189502 ns 193817 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31508207 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8962979 ns 8912375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 612555 ns 653712 ns 0.94
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4084 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4083 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 24005 ns 24189 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2002725 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 218459 ns 223791 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48990 ns 49151 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16833 ns 16584 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17000 ns 16917 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17375 ns 17042 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17000 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 190469.5 ns 199158 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10178467 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 1025083 ns 963270.5 ns 1.06
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 180751 ns 176322 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 511729.5 ns 512792 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405667 ns 404292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 406833 ns 404896 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865750 ns 864583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113107.5 ns 113852 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 398260 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 419334 ns 448709 ns 0.93
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 249302 ns 250173 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2320167 ns 2271145.5 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2032666 ns 2031292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2032750 ns 2033750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3296979.5 ns 3280292 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 235772 ns 247459 ns 0.95
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 9316032 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1906146 ns 2065875 ns 0.92
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 763141 ns 765823 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6438 ns 7145.5 ns 0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7459 ns 6958.5 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7500 ns 8541 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7625.5 ns 6479.5 ns 1.18
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 91012.5 ns 93682.5 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5337612 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 766083 ns 806084 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60811 ns 68781 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11250 ns 11708.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12500 ns 11875 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12375 ns 11000 ns 1.13
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11750 ns 12020.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 635802.5 ns 642017 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 36545961 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5443750 ns 5707875 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 410863.5 ns 421135 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23646 ns 24054 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2173927 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 318375 ns 228333 ns 1.39
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 51881 ns 54330 ns 0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2250 ns 2167 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 227819 ns 237805 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 10657257.5 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1962541.5 ns 1998833 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 181706.5 ns 190172 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 36458 ns 9333.5 ns 3.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 37208.5 ns 9104 ns 4.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 68375 ns 10521 ns 6.50
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 33583 ns 8959 ns 3.75
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 110413 ns 113550 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3069971.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 782958.5 ns 875353.5 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 74930 ns 78760 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 61834 ns 16729.5 ns 3.70
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 60417 ns 18250 ns 3.31
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 63250 ns 18104 ns 3.49
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 60792 ns 18458 ns 3.29
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 631585 ns 643636 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 16329413.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 4518292 ns 5156541 ns 0.88
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 387253 ns 396545 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 30375 ns 500 ns 60.75
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 27958 ns 459 ns 60.91
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 28000 ns 625 ns 44.80
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 27333 ns 500 ns 54.67
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35255 ns 35808 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1145044 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 274042 ns 323000 ns 0.85
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 46671 ns 46571 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 45666.5 ns 10375 ns 4.40
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 41458 ns 9791.5 ns 4.23
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 43667 ns 10375 ns 4.21
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 42083.5 ns 10750 ns 3.91
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 261124 ns 262020 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 17726081.5 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4572625 ns 5294125 ns 0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 377973 ns 382009.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397125 ns 399000 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288083 ns 288125 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287792 ns 288292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756250 ns 755625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111029 ns 113561 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 330051 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 364417 ns 367729.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 77050 ns 77481 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1457750.5 ns 1393333 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1134000 ns 1136083.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133146 ns 1131458.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2441375 ns 2438041 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 202543.5 ns 212129 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 10206496 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1612958 ns 1596167 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 327223 ns 329854 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7229 ns 7708 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7125 ns 7458.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8250 ns 9000 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7375 ns 7812 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 151960.5 ns 159498.5 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5735314.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 664375 ns 481750 ns 1.38
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 60471 ns 60340 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16083 ns 14667 ns 1.10
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15854 ns 15437.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16208 ns 15479.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 11666 ns 14979 ns 0.78
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 984433 ns 1030852 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41610226 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5845542 ns 6424458 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 438113.5 ns 435905 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25625 ns 26958 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 28875 ns 25209 ns 1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28583.5 ns 27208 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 28791 ns 24584 ns 1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 224317.5 ns 228128 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7495067 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 784833 ns 1045041.5 ns 0.75
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 115131 ns 120221 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 151104 ns 103791 ns 1.46
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 151833 ns 150833 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 108020.5 ns 148187 ns 0.73
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 113208 ns 116292 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1188662.5 ns 1163495 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42135085 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5881500 ns 6459417 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 596435 ns 607082 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 80312 ns 76416 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77791 ns 81020.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79583 ns 85083 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 82062 ns 79625 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 233891 ns 234622 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7191541 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 526958 ns 628124.5 ns 0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 127251 ns 127432 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 277958 ns 283166.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 295125 ns 316541 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 301229.5 ns 302917 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 274292 ns 315041.5 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1226024.5 ns 1204655 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 40089795 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6466583 ns 6660083 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 696586 ns 700322.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16333 ns 16708.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17375 ns 17333 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 17708.5 ns 17854.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16854.5 ns 16479 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 164956.5 ns 167006 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5631236 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 439750 ns 446708 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239672 ns 239982 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27208 ns 26125 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25604 ns 26917 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28833 ns 27208 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27708 ns 25333 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 1024329 ns 1047898 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 44101221 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5779875 ns 6661333.5 ns 0.87
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 705756 ns 718328 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 46000 ns 11084 ns 4.15
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 47166 ns 11625 ns 4.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 50208 ns 13000 ns 3.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 44396.5 ns 11666.5 ns 3.81
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 138159.5 ns 141188 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3525333.5 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 785667 ns 897333.5 ns 0.88
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 241372 ns 243182.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 66208 ns 22145.5 ns 2.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 62437.5 ns 21875 ns 2.85
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 66708.5 ns 22667 ns 2.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 62584 ns 21792 ns 2.87
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 740432.5 ns 756695 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 22275014 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5147542 ns 5374500 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 685850 ns 695018 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 595667 ns 63937.5 ns 9.32
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 351042 ns 63500 ns 5.53
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 434375 ns 66042 ns 6.58
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 875667 ns 63666.5 ns 13.75
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 121655 ns 124307.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3306959 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1367291 ns 1367917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 228942 ns 241283 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 1040166 ns 437854 ns 2.38
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 754187.5 ns 464833 ns 1.62
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 938959 ns 474208 ns 1.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1319167 ns 437729.5 ns 3.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 552090.5 ns 560487 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21939037.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6260750 ns 6247083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 718076 ns 733228 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7667 ns 7104.5 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7770.5 ns 7083 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7917 ns 8334 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7334 ns 7604 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 159948.5 ns 163142 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5546005 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 437729 ns 463833.5 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59871 ns 68371 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14958 ns 14542 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15708 ns 15396 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17042 ns 15458.5 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13500 ns 14750 ns 0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1010082 ns 1022438 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38267417 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5491083.5 ns 6461041 ns 0.85
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 405844 ns 412334 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6162395.5 ns 6159375 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6376146 ns 6372249.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6367229 ns 6374125 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11918000 ns 11910167 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301585.5 ns 302029 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 295713 ns 302953 ns 0.98
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19131208.5 ns 19119687 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19953937.5 ns 19945437.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19969917 ns 20008771 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36476917 ns 36510208.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1015986 ns 1019652 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1169730 ns 1173152.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 959 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 917 ns 958 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 959 ns 959 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 917 ns 959 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23705 ns 23843 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2142526 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 215416.5 ns 335916 ns 0.64
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 215342 ns 215882 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3625 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3708 ns 3666 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 293053.5 ns 300289 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11377670 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2046437 ns 2148500 ns 0.95
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 646355 ns 644731.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 17479 ns 8334 ns 2.10
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 17041 ns 8104 ns 2.10
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 24583 ns 9750 ns 2.52
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 17417 ns 8500 ns 2.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 134611 ns 137456 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3320553.5 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 791583 ns 796375 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67721 ns 68311 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 17646 ns 11666 ns 1.51
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 19250 ns 12083 ns 1.59
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 20270.5 ns 12583 ns 1.61
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 18625 ns 12750 ns 1.46
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 703363 ns 721292 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 20902978.5 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5212209 ns 5345750 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 362123 ns 373344 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23131 ns 23235 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2015510 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 315458.5 ns 226791 ns 1.39
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 54431 ns 51721 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2917 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 3083 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3417 ns 3250 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3167 ns 2834 ns 1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 207060.5 ns 216259.5 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9314773 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1563646 ns 1692958 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 173121.5 ns 161612 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 47041.5 ns 11625 ns 4.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 46250 ns 11229.5 ns 4.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 49562.5 ns 13250 ns 3.74
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 46000 ns 12166 ns 3.78
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 135869.5 ns 139967.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3451740 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 845208 ns 892584 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 240542 ns 243863 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 62917 ns 21083 ns 2.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 58750 ns 20396 ns 2.88
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 61708 ns 26062.5 ns 2.37
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 58625 ns 21604 ns 2.71
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 635101 ns 652418 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19617768 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4454000 ns 4821708.5 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 654596 ns 672612 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4416 ns 4417 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4458 ns 4416 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4458 ns 4333 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24739 ns 24831 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2281052 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 220917 ns 223938 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 52320 ns 52890.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16750 ns 16333.5 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16625 ns 16750 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16834 ns 16583 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16667 ns 16625 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 347277.5 ns 356581 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12189793 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1125021 ns 1752937.5 ns 0.64
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 210311.5 ns 210052 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 31041 ns 1958 ns 15.85
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 31625 ns 1917 ns 16.50
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 32209 ns 2166 ns 14.87
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 30792 ns 2084 ns 14.78
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36064 ns 36754.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1217452 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 274750 ns 299041 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 208381 ns 208032 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 51854 ns 16958.5 ns 3.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 51125 ns 19042 ns 2.68
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 52291.5 ns 17458 ns 3.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 51042 ns 18062.5 ns 2.83
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 299796 ns 307642 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20632540.5 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4717208 ns 5677458.5 ns 0.83
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 702336 ns 709468 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 60270.5 ns 59125 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65125 ns 66208 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65603.5 ns 66083.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 53812.5 ns 51334 ns 1.05
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66512 ns 66592 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 95131 ns 113701 ns 0.84
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 190937.5 ns 210458 ns 0.91
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 164000.5 ns 143000 ns 1.15
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 160125 ns 119583 ns 1.34
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 317979 ns 307688 ns 1.03
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 229832 ns 234156 ns 0.98
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 581689.5 ns 598956 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 4059916 ns 123833.5 ns 32.79
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2016750 ns 123125 ns 16.38
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1840083 ns 86500 ns 21.27
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 6874334 ns 82958 ns 82.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192888 ns 190129 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5436632 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2069896 ns 1825667 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 170621.5 ns 188412 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5660770.5 ns 1927375 ns 2.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3272042 ns 1909416.5 ns 1.71
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3180250 ns 1906875 ns 1.67
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 8897583.5 ns 1931021 ns 4.61
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 567485 ns 578778.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25370262 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9213458 ns 9303959 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1081814 ns 1081141.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 291 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21777 ns 22349 ns 0.97
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2224619 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 347709 ns 372291 ns 0.93
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 45540 ns 45590 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 262738 ns 272164 ns 0.97
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9744251 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1535584 ns 1469500 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 184041.5 ns 187152 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 16917 ns 9250 ns 1.83
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 16458 ns 8708 ns 1.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 20750 ns 11166 ns 1.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 19354 ns 10459 ns 1.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 131285 ns 134628.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3471620.5 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 840250 ns 897749.5 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 238722 ns 241763 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 14270.5 ns 10125 ns 1.41
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 15583 ns 8458.5 ns 1.84
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 16000 ns 14375 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 15125 ns 9542 ns 1.59
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 572024 ns 584537 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20186707 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4247687.5 ns 4632562 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 633925 ns 645752 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3974792 ns 58375 ns 68.09
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1894375 ns 46625 ns 40.63
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1739604 ns 46708 ns 37.24
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 6779583 ns 82000 ns 82.68
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39793 ns 40806 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1345357.5 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1158875 ns 1140854.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80431 ns 78371 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5226750 ns 1934584 ns 2.70
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3143125.5 ns 1981708 ns 1.59
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3775208 ns 1989334 ns 1.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 8311708 ns 1899750 ns 4.38
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 234142.5 ns 239556 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30862071 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11350417 ns 11301583 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1021228 ns 1030691 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 419521 ns 422125 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 418166 ns 417583 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 422375 ns 419750 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 417708.5 ns 416292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 236927 ns 241184 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7988410.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 527104.5 ns 546083 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 288373 ns 289943 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 677458 ns 752875.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 681312.5 ns 755666 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 760583 ns 675729 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 747563 ns 760021 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1135579.5 ns 1151706 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46261240 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6579583 ns 6939708 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 927168 ns 927380 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3447833 ns 3457437.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3420292 ns 3437021 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3413833 ns 3434709 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3462042 ns 3439146 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 174701 ns 201324 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8218002.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1372271 ns 1424084 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 436313 ns 412665 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6182625.5 ns 6238000 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6194250 ns 6200250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6217479.5 ns 6194458 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6188333 ns 6143770.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1072717 ns 1091727.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50181237 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7350271 ns 8063541.5 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1561243 ns 1569386 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 472292 ns 473666 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 341917 ns 340792 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 340625 ns 342166 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 903292 ns 905125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46946.5 ns 46953 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 390354 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 413792 ns 496959 ns 0.83
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 250932 ns 251203 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2324958 ns 2275334 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2037313 ns 2043625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2038958.5 ns 2032437 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3288312.5 ns 3282416.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 254014 ns 283225 ns 0.90
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 8472255 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2170416.5 ns 2237145.5 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 791641 ns 791808 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2963708 ns 57833 ns 51.25
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1577375 ns 45958 ns 34.32
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2716541 ns 46250 ns 58.74
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 4279458 ns 82792 ns 51.69
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28036 ns 28918 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1033104 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1161396 ns 1145250 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75880 ns 78811 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5369084 ns 2000229 ns 2.68
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3250416.5 ns 2089833 ns 1.56
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4448458.5 ns 2077250 ns 2.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 8412583 ns 1980437.5 ns 4.25
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 237966 ns 244212 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37959803 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11555333 ns 11407979 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1047409 ns 1055251 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3974229 ns 58000 ns 68.52
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1921167 ns 46250 ns 41.54
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1742209 ns 46666 ns 37.33
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 6685833.5 ns 83041 ns 80.51
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 49792 ns 50656 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 818474 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1117792 ns 1123000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78246 ns 73121 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5155834 ns 1903916 ns 2.71
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3043500 ns 1902541 ns 1.60
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3837667 ns 1978250 ns 1.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 7749958 ns 1902959 ns 4.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 244579 ns 251664 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 18235453 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10215937.5 ns 9794437.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 927928 ns 936124.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4333 ns 333 ns 13.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4209 ns 292 ns 14.41
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4834 ns 416 ns 11.62
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3875 ns 292 ns 13.27
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35010 ns 35119.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1256608 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 281667 ns 308104.5 ns 0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48850 ns 50550 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 13250 ns 7937.5 ns 1.67
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 13375 ns 7625 ns 1.75
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 15500 ns 7625 ns 2.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 12459 ns 8167 ns 1.53
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 206263 ns 218323.5 ns 0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20165898 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4657770.5 ns 4836354 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 373924 ns 381674 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31771 ns 33417 ns 0.95
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1188142 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 255291.5 ns 259375 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 39421 ns 43851 ns 0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 2792 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3375 ns 2875 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3459 ns 2916 ns 1.19
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2667 ns 2667 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 195647.5 ns 205231.5 ns 0.95
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7917427.5 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1054125 ns 1294875 ns 0.81
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 152102 ns 166746 ns 0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 4033875 ns 437042 ns 9.23
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2203417 ns 422021 ns 5.22
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1936229 ns 424229 ns 4.56
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 7212459 ns 425834 ns 16.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 139985 ns 142985.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6197825 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2258375 ns 2238375 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 355557.5 ns 375684 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 7756667 ns 3809770.5 ns 2.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5382333 ns 3802375 ns 1.42
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5281167 ns 3804250 ns 1.39
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 11556729.5 ns 3793125 ns 3.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 770872 ns 782254 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32375322 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10991875 ns 11146187.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1497828 ns 1312364 ns 1.14
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49915312.5 ns 49907416.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35517417 ns 35559584 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35546604.5 ns 35529250 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 98239417 ns 96899084 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1611413 ns 1625871 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1014223.5 ns 1003290 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154593958 ns 154966354 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112413083.5 ns 112363000 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112280959 ns 112555750 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 300055687.5 ns 296527604.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6477347 ns 6450345 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5540958 ns 5530212.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 20083 ns 19374.5 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 18562.5 ns 18750 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 16041 ns 17353.5 ns 0.92
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15375 ns 15188 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20323 ns 20779 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1119679 ns
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 223979.5 ns 224333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26100 ns 26660 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10667 ns 10917 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 8854.5 ns 8834 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9083 ns 9291 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17708 ns 17291 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 290480 ns 299343 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9939862 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1566792 ns 1655375 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 151932 ns 155331 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 16500 ns 8312.5 ns 1.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 17583 ns 8459 ns 2.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 20083 ns 10895.5 ns 1.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 16500 ns 9312.5 ns 1.77
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 129278.5 ns 142637 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3610260.5 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 796375.5 ns 798083 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 238692 ns 241143 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 15834 ns 10333.5 ns 1.53
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 15416 ns 9042 ns 1.70
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 17687.5 ns 9583 ns 1.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 15375 ns 8937.5 ns 1.72
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 686867 ns 705801.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23595623.5 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 4793187 ns 5435917 ns 0.88
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 666035 ns 657647 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 36458.5 ns 9020.5 ns 4.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 38333 ns 10229 ns 3.75
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 74333.5 ns 11250 ns 6.61
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 37458.5 ns 9792 ns 3.83
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 133913.5 ns 137059 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3390967 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 833187.5 ns 882166.5 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 75120 ns 78120 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 54083 ns 13020.5 ns 4.15
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 50583 ns 12583.5 ns 4.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 54750 ns 13583 ns 4.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 50854 ns 13458 ns 3.78
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 639697.5 ns 651470.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19609315 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4485208.5 ns 4779312.5 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 346863 ns 356033 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4208 ns 459 ns 9.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4250 ns 458 ns 9.28
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4834 ns 625 ns 7.73
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4083 ns 459 ns 8.90
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34711 ns 35430 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1210488 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 275854.5 ns 385417 ns 0.72
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 208282 ns 210072 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 14146 ns 8166 ns 1.73
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 13937.5 ns 8000 ns 1.74
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 15666 ns 8937.5 ns 1.75
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 13708 ns 8208 ns 1.67
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 230721 ns 238141 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22496798.5 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4937375 ns 5550500 ns 0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 666285.5 ns 670717 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 17541.5 ns 16417 ns 1.07
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 16250 ns 16709 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 14521 ns 15209 ns 0.95
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 11125 ns 10312.5 ns 1.08
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21465 ns 21707 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1152895.5 ns
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 208042 ns 217458 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 188121 ns 194532 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32062.5 ns 31854.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32417 ns 32167 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32208 ns 32250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32395.5 ns 32125 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 306637 ns 316460 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11224318 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1690625 ns 1889916 ns 0.89
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 606015 ns 608847 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3934979 ns 450417 ns 8.74
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2217312.5 ns 482813 ns 4.59
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2342145.5 ns 444604 ns 5.27
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 6834708 ns 440875 ns 15.50
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194727 ns 193879 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5990269 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2080250 ns 2124500 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 352143 ns 376794 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 7550250 ns 3673458 ns 2.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5240834 ns 3802062.5 ns 1.38
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5157500 ns 3822709 ns 1.35
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 10890000 ns 3821333 ns 2.85
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 578849 ns 588897 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29191067 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10413125 ns 9577042 ns 1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1218310 ns 1393435 ns 0.87
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 784817521 ns 783185125 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 542901125 ns 542907542 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 544360541 ns 543132625 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1524372458 ns 1514951833.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22761020.5 ns 22763713 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 13995220 ns 14159478.5 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2525053917 ns 2527739209 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 3158105667 ns 1799023667 ns 1.76
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1790903166 ns 1787795417 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4813364000 ns 4787274417 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 374548499 ns 333649192 ns 1.12
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88056302 ns 88087394 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77334 ns 76666.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76667 ns 79083 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78916.5 ns 79375 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 77375 ns 78124.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 235013.5 ns 238895.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7866147 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 524979 ns 542209 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 109721 ns 111271 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 280208.5 ns 277000 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 293084 ns 278895.5 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 285646 ns 194979 ns 1.47
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 192125 ns 259250 ns 0.74
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1120762 ns 1134646.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47294846.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6215791 ns 6160709 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 643071 ns 645127 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199524833.5 ns 199977437.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139211333 ns 139216750 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139220750 ns 139454459 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 389521625 ns 389873250 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5844935.5 ns 5849131.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3420559 ns 3425810.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619399375 ns 621409333 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 439802250 ns 440537375 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439385416.5 ns 440145604 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1198327959 ns 1186223625 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26665852 ns 26711378 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21821827 ns 21741902 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 553792 ns 7291 ns 75.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 280542 ns 6084 ns 46.11
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 390937.5 ns 6291 ns 62.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 847750 ns 10292 ns 82.37
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27899 ns 28202.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1268229 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 559687.5 ns 601583 ns 0.93
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48850 ns 48405.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 708542 ns 220749.5 ns 3.21
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 455333 ns 222374.5 ns 2.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 890499.5 ns 222542 ns 4.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1003292 ns 217625 ns 4.61
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 237988 ns 245623 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32870826 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8991520.5 ns 8971334 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 532595 ns 543906 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 17521 ns 8145.5 ns 2.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 16791.5 ns 10083 ns 1.67
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 23145.5 ns 10833.5 ns 2.14
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 20334 ns 10000.5 ns 2.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 130687 ns 136003.5 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3317034 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 863959 ns 906833 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 72930 ns 72945.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 13104.5 ns 7500 ns 1.75
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 13000 ns 7209 ns 1.80
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 14667 ns 8292 ns 1.77
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 13416 ns 7500 ns 1.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 569833.5 ns 587405 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20622054.5 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4421166.5 ns 4757959 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 320658 ns 326203 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 30500 ns 500 ns 61
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 27459 ns 458 ns 59.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 28375 ns 542 ns 52.35
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 27666 ns 375 ns 73.78
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26372 ns 26999 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1201158 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 459833 ns 493458.5 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 49171 ns 49231 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 46604.5 ns 9458 ns 4.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 44708 ns 10250 ns 4.36
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 46374.5 ns 10521.5 ns 4.41
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 44666.5 ns 10125 ns 4.41
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 267493.5 ns 275766.5 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23479888 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5496458 ns 6076395.5 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 392323 ns 401444 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 108437.5 ns 107104.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 99500 ns 99896 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 99917 ns 101145.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146520.5 ns 146459 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24197 ns 24813 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1207580 ns
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 260875 ns 277416.5 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190092 ns 192192 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 478708 ns 479500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 523083 ns 494084 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 497334 ns 478958 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 479062.5 ns 528667 ns 0.91
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 248813 ns 258431 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11991369.5 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2125229 ns 2276458 ns 0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 622225 ns 624467 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5562.5 ns 5750.5 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 5917 ns 6917 ns 0.86
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7479 ns 6833.5 ns 1.09
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4334 ns 4458 ns 0.97
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16547 ns 18139 ns 0.91
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 72491 ns 73231 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11687.5 ns 11854 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10584 ns 10500.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11187.5 ns 11104.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16541 ns 17083 ns 0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 231345 ns 235890 ns 0.98
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 377104 ns 372074 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 40000 ns 38750 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 52437 ns 51292 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52125.5 ns 52729.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13791 ns 15834 ns 0.87
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20260 ns 20456 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 79691 ns 87011 ns 0.92
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 41500 ns 36875 ns 1.13
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 30416.5 ns 34729 ns 0.88
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 31125 ns 32167 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57292 ns 57000 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 208650 ns 212876 ns 0.98
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 403134 ns 418835 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1791 ns 1791 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1709 ns 1708 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2104.5 ns 2187.5 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 2125 ns 1875 ns 1.13
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20031.5 ns 20570.5 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1108740 ns
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 302250 ns 329917 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 28990 ns 31020 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2292 ns 2209 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2166.5 ns 2250 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2333 ns 2500 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2208 ns 2208 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 218720 ns 226270.5 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9455550 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1546125 ns 1683458.5 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 139222 ns 142136.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5583.5 ns 5042 ns 1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4646 ns 4500 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5583 ns 6208.5 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6417 ns 4666.5 ns 1.38
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 159519 ns 163224.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5766275 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 443417 ns 800792 ns 0.55
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 60471 ns 75611 ns 0.80
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 8291 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8083 ns 8209 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8750 ns 8583 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8250 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 940829 ns 960930 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 39155504 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5477667 ns 5752708 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 385303 ns 398144 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 549708 ns 56791 ns 9.68
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 313333 ns 57459 ns 5.45
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 420167 ns 57667 ns 7.29
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 879208 ns 58208 ns 15.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37342 ns 38436 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1173763 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 356041.5 ns 411813 ns 0.86
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 207582 ns 218852 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 965958 ns 448812.5 ns 2.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 690250 ns 499084 ns 1.38
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 979708 ns 465709 ns 2.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1175375 ns 481396 ns 2.44
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 273355 ns 282356.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26028179 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8107333.5 ns 7964500 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 800411.5 ns 842729 ns 0.95
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3334104.5 ns 3322916 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2338854.5 ns 2338771 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2337104 ns 2339375 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6329958 ns 6304166.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205352 ns 204545 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 205572 ns 202912 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11443250 ns 11552375 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8304750.5 ns 8313541.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8298563 ns 8336875 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21095145.5 ns 21101437.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 735485 ns 734673 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1063824.5 ns 1078791.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4854.5 ns 6166 ns 0.79
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5583 ns 4916.5 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6291 ns 6541 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6791.5 ns 4875 ns 1.39
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 152797.5 ns 158133 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5417356 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 774375 ns 887167 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56590 ns 57035.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7166 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7041.5 ns 7209 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 7292 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7459 ns 7083 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 789393 ns 816855 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 34656454 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5257437.5 ns 6166979.5 ns 0.85
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 378563 ns 384744.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3119334 ns 123458 ns 25.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1745083 ns 131229 ns 13.30
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2891458.5 ns 100000 ns 28.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 4560645.5 ns 94625 ns 48.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 159047 ns 160516.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6050883 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2930958 ns 2207458 ns 1.33
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185181 ns 187112 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6113166 ns 1964000 ns 3.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3539750 ns 2023146 ns 1.75
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3525062.5 ns 2028667 ns 1.74
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 9720709 ns 2018916.5 ns 4.81
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 771116 ns 789517 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33201774 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10973750 ns 11417250 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1117479.5 ns 1260093 ns 0.89
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 34146 ns 33813 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36146 ns 36729 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 34479.5 ns 34708.5 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 708 ns 667 ns 1.06
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15370 ns 15818 ns 0.97
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 80930 ns 82161 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2750 ns 2583 ns 1.06
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2792 ns 2709 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3167 ns 2959 ns 1.07
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2229.5 ns 2125 ns 1.05
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 146666.5 ns 152979.5 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 353043 ns 352884 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 559458 ns 7250 ns 77.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 282458 ns 6042 ns 46.75
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 394500 ns 6125 ns 64.41
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 826875 ns 9958 ns 83.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36672 ns 37656 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1162707 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 658604.5 ns 431042 ns 1.53
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49520 ns 49591 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 696041 ns 214000 ns 3.25
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 435812 ns 232937.5 ns 1.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 870770.5 ns 221834 ns 3.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 985604.5 ns 232000 ns 4.25
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 252602.5 ns 258714 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26336772 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8008125 ns 7857271 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 519460 ns 526085 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 4000 ns 3917 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 4000 ns 3958 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22005 ns 22767 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2135206 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 242458 ns 244500 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 46161 ns 47941 ns 0.96
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15042 ns 14667 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14958 ns 15000 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15125 ns 14959 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14917 ns 14959 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 334532.5 ns 344878 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11453292 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 997792 ns 1074437.5 ns 0.93
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 192172 ns 201792 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 4158541.5 ns 120021 ns 34.65
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2056417 ns 98958.5 ns 20.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1881542 ns 104666.5 ns 17.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 7039958 ns 144250 ns 48.80
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 148404 ns 160419 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5796388 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2225209 ns 2228291 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 184542 ns 170682 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5874083 ns 1891375 ns 3.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3428459 ns 1833541.5 ns 1.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3434291 ns 1894375 ns 1.81
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 9513125 ns 1924667 ns 4.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 754215 ns 772105.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30850166 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10913312.5 ns 10866208 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1087499 ns 1240333 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 467937.5 ns 20250 ns 23.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 278209 ns 18937.5 ns 14.69
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 528667 ns 20542 ns 25.74
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 599187.5 ns 20208 ns 29.65
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 124428.5 ns 127944 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3459386 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1420625 ns 1385750 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81731 ns 82111 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 771479 ns 216708 ns 3.56
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 473208 ns 255583 ns 1.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 708938 ns 218146 ns 3.25
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1080250 ns 217458 ns 4.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 558260 ns 580859 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19290584 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6248458 ns 6240292 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 474794.5 ns 484605 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 25104 ns 25687 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 31666 ns 31687.5 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28333 ns 29145.5 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1291.5 ns 1541 ns 0.84
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16491 ns 17059 ns 0.97
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82631 ns 83471 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4708 ns 4896 ns 0.96
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 4854.5 ns 4687.5 ns 1.04
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5542 ns 5208 ns 1.06
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4708 ns 4708 ns 1
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 224621.5 ns 231729 ns 0.97
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 385044 ns 400815 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 304124.5 ns 304916 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 304917 ns 307083.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 306208 ns 310250 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 309792 ns 307458 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 256784 ns 260954.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8317280 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1129124.5 ns 1003667 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 277373 ns 282392 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 585104 ns 530417 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 529541.5 ns 536417 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 543124.5 ns 533416.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 529708.5 ns 540917 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1169406 ns 1194615.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44300140 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6115584 ns 6650583.5 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 867787 ns 886938 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 597416.5 ns 19292 ns 30.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 329083.5 ns 20437.5 ns 16.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 410917 ns 21583 ns 19.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 891521 ns 19250 ns 46.31
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 128708.5 ns 134679 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3777831.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1542125 ns 1513292 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76581 ns 76825.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 800041 ns 215083 ns 3.72
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 475291.5 ns 212625 ns 2.24
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 701041.5 ns 215021 ns 3.26
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1201209 ns 249312.5 ns 4.82
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 853223 ns 889532 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24974069 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7559729 ns 7210062.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 541885 ns 554056 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6312.5 ns 6583 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6666 ns 6937.5 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7875 ns 9208 ns 0.86
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7875 ns 6792 ns 1.16
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 155050 ns 160487 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5878771.5 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 786750 ns 869792 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 69765.5 ns 69890 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10187.5 ns 10000 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9958 ns 9854.5 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11645.5 ns 10292 ns 1.13
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9791.5 ns 10375 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 870241 ns 896806 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37856106 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5278479 ns 5937375 ns 0.89
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 388448.5 ns 398234 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4500 ns 4125 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5042 ns 5542 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6000 ns 6750 ns 0.89
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6729.5 ns 4750 ns 1.42
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 158522 ns 162556 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5543448 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 772000 ns 844750 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 62160 ns 62561 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 7500 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 7250 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8312.5 ns 7667 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7541 ns 7166 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 816695 ns 844691 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 38779263 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5543708.5 ns 5794250.5 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 392668.5 ns 401898.5 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14518750 ns 14528708 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10129624.5 ns 10144083 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10127042 ns 10119791 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27754541 ns 27783209 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 528414 ns 561716 ns 0.94
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 390163 ns 405538.5 ns 0.96
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46474417 ns 46624812 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33475041.5 ns 33411666.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33403292 ns 33562500 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85360084 ns 85401583 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2627902 ns 2800168 ns 0.94
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3282258.5 ns 3289235 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 610750 ns 66500 ns 9.18
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 374458 ns 68375 ns 5.48
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 440146 ns 68875 ns 6.39
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 974562.5 ns 67250 ns 14.49
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 136474.5 ns 138855.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3590577 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1546353.5 ns 1526666.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 228622.5 ns 238492 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 1080542 ns 444500.5 ns 2.43
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 751042 ns 442146 ns 1.70
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 968354 ns 441583 ns 2.19
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1376125 ns 493750 ns 2.79
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 789159.5 ns 807637.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26131425 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7922542 ns 7704542 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 799502 ns 803267.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 22333 ns 542 ns 41.20
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 23375 ns 625 ns 37.40
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 49958 ns 666 ns 75.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 18041 ns 500 ns 36.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32198 ns 33435 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1217074 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 290584 ns 422917 ns 0.69
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49801 ns 52200 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 45042 ns 9458.5 ns 4.76
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 40959 ns 10771 ns 3.80
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 43250 ns 10416.5 ns 4.15
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 41083.5 ns 9666.5 ns 4.25
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 294520 ns 303460.5 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21753846 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4830770.5 ns 5666958.5 ns 0.85
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 387483 ns 397994 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9834 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9834 ns 9916 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9875 ns 9792 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9875 ns 9792 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23374 ns 24011 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2073950.5 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 218625 ns 225541 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216122 ns 218962 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46334 ns 46000 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45833 ns 46167 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46458 ns 46416 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46166 ns 46334 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 301086 ns 315869 ns 0.95
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11347035 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 963208 ns 1098270.5 ns 0.88
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 625096 ns 628475.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 551625 ns 56250 ns 9.81
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 317875 ns 57208 ns 5.56
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 413916 ns 57167 ns 7.24
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 884791.5 ns 57833 ns 15.30
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28786 ns 29662 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1291553 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 647917 ns 616041 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206132 ns 218842 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 995541.5 ns 450333 ns 2.21
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 719479.5 ns 473958 ns 1.52
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 1170021 ns 468792 ns 2.50
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1244792 ns 442709 ns 2.81
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 252769 ns 260564.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33932766 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9473042 ns 9323750 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 845987 ns 849638 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 614625 ns 607437.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 650709 ns 677167 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 649833 ns 619062.5 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 642250 ns 645083.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 227111 ns 227369 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8674519 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1360563 ns 1393791.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 249152.5 ns 251853 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2246395.5 ns 2229542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2228000 ns 2242667 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2229937.5 ns 2238417 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2234687.5 ns 2233500 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1048698.5 ns 1055691 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48260078.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7344750 ns 7106083 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1381842 ns 1380353 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 501104.5 ns 20396 ns 24.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 291562.5 ns 20625 ns 14.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 528291.5 ns 21333.5 ns 24.76
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 607896 ns 23708 ns 25.64
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 127286.5 ns 128483 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3562244 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1528187.5 ns 1530250 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76185.5 ns 82281 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 802209 ns 219229.5 ns 3.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 495791.5 ns 223875 ns 2.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 717333.5 ns 221083 ns 3.24
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 1168708 ns 219917 ns 5.31
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 826637 ns 851484 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27600838 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7736667 ns 7710292 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 564375 ns 562290 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 23167 ns 500 ns 46.33
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 22709 ns 584 ns 38.89
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 49750 ns 625 ns 79.60
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 19792 ns 500 ns 39.58
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23038 ns 23568 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1182719 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 323125 ns 453729.5 ns 0.71
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 50411 ns 50170 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 46875 ns 10937.5 ns 4.29
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 44417 ns 10479 ns 4.24
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 46979 ns 11166 ns 4.21
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 44541.5 ns 10208 ns 4.36
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 273384 ns 278881.5 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 23625982 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5366166 ns 6153209 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 411123 ns 418644 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 19687.5 ns 10416.5 ns 1.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 18500 ns 9500 ns 1.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 23937 ns 11000 ns 2.18
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 21083.5 ns 8958 ns 2.35
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 134758.5 ns 137213 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3415558 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 867333 ns 886500 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 67810 ns 74561 ns 0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 13333 ns 7458 ns 1.79
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 12937.5 ns 7750 ns 1.67
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 14500 ns 8208 ns 1.77
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 13459 ns 7541 ns 1.78
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 538019 ns 553485 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 18934516.5 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 3936125 ns 4191417 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 335303 ns 340023 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1666.5 ns 1791.5 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1500 ns 1625 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 2083 ns 0.88
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1416.5 ns 1583 ns 0.89
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20929 ns 21340 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1201516 ns
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 293750 ns 310625 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 192642 ns 192401.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3395.5 ns 3292 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3375 ns 3375 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3500 ns 3583 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3500 ns 3375 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 235765.5 ns 244685 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11074080 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1582208 ns 1830688 ns 0.86
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 596540 ns 598576 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148250 ns 148667 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 129854 ns 128833 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 128895.5 ns 129604 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225979.5 ns 225042 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24092.5 ns 24647 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1235643 ns
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 292042 ns 278416 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36870 ns 37400 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 158833 ns 143709 ns 1.11
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 124333 ns 124625 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 124000 ns 110395.5 ns 1.12
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 250771 ns 287812.5 ns 0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 236446 ns 242298 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 11052857 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2005854 ns 2059479 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 225482 ns 238587 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 429084 ns 7125 ns 60.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 237167 ns 6000 ns 39.53
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 498750 ns 6000 ns 83.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 488542 ns 10062.5 ns 48.55
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32663 ns 33200 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1216110 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 339854 ns 358750 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50830 ns 52880 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 700417 ns 220291 ns 3.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 436375 ns 231500 ns 1.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 888520.5 ns 229125 ns 3.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 954833 ns 245229.5 ns 3.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 267203 ns 272719 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29188160 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8201709 ns 8345291.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 526085 ns 536095 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15396 ns 15417 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15125 ns 15167 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16000 ns 17042 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15625 ns 15375 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 154544.5 ns 158597.5 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5785019 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 755084 ns 852042 ns 0.89
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 239652 ns 242502 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23708 ns 23937.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23833 ns 24666 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24334 ns 23874.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 22750 ns 23291.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 912012 ns 931616 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 40200102.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5378958 ns 5615896 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 693006 ns 698756 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 45792 ns 9958 ns 4.60
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 42792 ns 10166 ns 4.21
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 50250 ns 12333 ns 4.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 43833.5 ns 9083 ns 4.83
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 137439.5 ns 141537 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3438224.5 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 725875 ns 805292 ns 0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 70921 ns 77251 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 57979.5 ns 14083 ns 4.12
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 54833 ns 14250 ns 3.85
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 58625 ns 14208.5 ns 4.13
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 55750 ns 13375 ns 4.17
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 746727 ns 768706 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 20973525 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 4583646 ns 5278042 ns 0.87
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 367203 ns 378343 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 46479 ns 10104.5 ns 4.60
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 44250 ns 10000 ns 4.42
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 50770.5 ns 11354 ns 4.47
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 44583 ns 9791.5 ns 4.55
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 136494.5 ns 139922.5 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3475417.5 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 874292 ns 897959 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 72190 ns 77161 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 52500 ns 12333 ns 4.26
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 49958 ns 12709 ns 3.93
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 53854.5 ns 13000 ns 4.14
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 50520.5 ns 12833.5 ns 3.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 610073.5 ns 626138 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19162614 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4104208 ns 4505687.5 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 343583 ns 350573 ns 0.98
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 31333 ns 27729 ns 1.13
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 33271 ns 35375 ns 0.94
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31167 ns 32291 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1979.5 ns 2041 ns 0.97
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16542 ns 16815 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 73361 ns 83101 ns 0.88
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5291 ns 5291.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5125 ns 5146 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5417 ns 5209 ns 1.04
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6541.5 ns 6229.5 ns 1.05
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 147388 ns 151130 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 371153 ns 372413 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4292 ns 250 ns 17.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4084 ns 292 ns 13.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4875 ns 375 ns 13
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4167 ns 250 ns 16.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25733 ns 26290 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1193675 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 340687 ns 357500 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48841 ns 48805.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 13625 ns 7354 ns 1.85
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13833 ns 7250 ns 1.91
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 15708 ns 8041 ns 1.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 14208.5 ns 6979 ns 2.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 194695.5 ns 200306 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 24326336 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5508000 ns 6097521 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 392403 ns 397569 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 31709 ns 1958 ns 16.19
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 31625 ns 2041 ns 15.49
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 32167 ns 2084 ns 15.44
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 31167 ns 2000 ns 15.58
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26408 ns 27273 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1195704 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 309499.5 ns 493416.5 ns 0.63
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 209172 ns 209702 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 54000 ns 17541 ns 3.08
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 51250 ns 18166.5 ns 2.82
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 54084 ns 17667 ns 3.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 52208 ns 17666.5 ns 2.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 279056 ns 285604 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25130314 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5457854.5 ns 6161167 ns 0.89
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 716901 ns 724677 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148062.5 ns 174417 ns 0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 152625 ns 167583.5 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 156854 ns 151417 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 148833 ns 145583 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 220541 ns 225867 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7910330 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1433416.5 ns 1429395.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 176132 ns 227572 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1331604 ns 1321729 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1325375 ns 1323417 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1321771 ns 1328313 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1315145.5 ns 1325750 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 983917 ns 1001329 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47509928.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6671667 ns 6753917 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1112170 ns 1011639.5 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25750 ns 24896.5 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25333 ns 25250 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28395.5 ns 28000 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25729 ns 25542 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 265795.5 ns 271026.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7976824.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 967791.5 ns 986750 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 117731 ns 119521 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 172583 ns 117833 ns 1.46
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 118708 ns 120083 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 170583.5 ns 118375 ns 1.44
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117292 ns 176875 ns 0.66
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1197612.5 ns 1213900 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45677207 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6174459 ns 6376312.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 600295 ns 614965 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3458 ns 291 ns 11.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3334 ns 375 ns 8.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6792 ns 375 ns 18.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 2875 ns 292 ns 9.85
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22675 ns 23468 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1239187 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 290750 ns 446666 ns 0.65
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 49411 ns 49170 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 13917 ns 7562.5 ns 1.84
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13791.5 ns 7584 ns 1.82
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 16000 ns 8000 ns 2
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 14250 ns 6875 ns 2.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 200825 ns 206004 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24296939 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5238458 ns 5961166 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 396233 ns 407824 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6750 ns 5812 ns 1.16
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6042 ns 5937.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6792 ns 7333 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6500 ns 6854.5 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 165626 ns 167575 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5771286 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 502625 ns 672646 ns 0.75
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 238537 ns 240143 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9959 ns 9875 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9791.5 ns 9834 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10125 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9542 ns 9854 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 959124 ns 978859.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41918838 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5914646 ns 5692125.5 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 679656 ns 683826 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 666 ns 708 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 666 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23007 ns 23025 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2142862 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 222000 ns 214354 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 215712 ns 216952 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4667 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4709 ns 4667 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 4958 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4542 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 229000 ns 242032 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9639894 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1615667 ns 1648667 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 600836 ns 606251 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 17333 ns 8750 ns 1.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 17792 ns 8500 ns 2.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 22917 ns 9917 ns 2.31
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 19458 ns 8375 ns 2.32
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 134999.5 ns 139395 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3466882.5 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 756875 ns 800687.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 69155.5 ns 77821 ns 0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 14583 ns 8625 ns 1.69
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 14542 ns 8625 ns 1.69
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 16375 ns 8979.5 ns 1.82
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 14896 ns 8479 ns 1.76
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 657528.5 ns 674531.5 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20992340 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4676208 ns 4665667 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 351823 ns 358963 ns 0.98
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 127208 ns 126000 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129354 ns 130375 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130583.5 ns 129416 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183354 ns 183687.5 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46414 ns 46315 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 96011 ns 97061 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 303500 ns 332208 ns 0.91
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 345209 ns 323917 ns 1.07
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 327375 ns 315709 ns 1.04
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 615750 ns 569000 ns 1.08
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 205613.5 ns 209770 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 489039 ns 517105 ns 0.95
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396875 ns 397958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287958 ns 288166 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287917 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 754667 ns 756041.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43605 ns 44247 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1378944 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 421709 ns 421167 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83791 ns 84151 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1454625 ns 1380646 ns 1.05
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1138125 ns 1132937.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1134958 ns 1131583.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2484042 ns 2441875 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 260159.5 ns 276054.5 ns 0.94
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 12659523.5 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1794291 ns 1744958 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 354928 ns 354794 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 666271 ns 655000 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 676375.5 ns 645458 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 653250 ns 606125 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 626667 ns 651333 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 219293.5 ns 211637.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8648715 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1345459 ns 1332417 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 244872 ns 234477 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2450104.5 ns 2442417 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2446958 ns 2443729 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2449083.5 ns 2460479.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2434250 ns 2466125 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1069321.5 ns 1084419 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 53497546 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7292437.5 ns 9616354 ns 0.76
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1446123 ns 1491474 ns 0.97
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 33792 ns 32917 ns 1.03
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 35375 ns 35833 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34583 ns 35333 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 895.5 ns 958 ns 0.93
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16123 ns 16181 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 72861 ns 81781 ns 0.89
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3000 ns 3083 ns 0.97
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3333 ns 3166 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3542 ns 3417 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3000 ns 3042 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 146404.5 ns 149907.5 ns 0.98
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 348953 ns 345503 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3811625 ns 406833.5 ns 9.37
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2040334 ns 408833 ns 4.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1910292 ns 409208.5 ns 4.67
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 6752062.5 ns 420333 ns 16.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42704 ns 44137 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1384773 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1162708 ns 1179333.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 243042 ns 242582 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 7177541 ns 3874541 ns 1.85
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4995166 ns 3981625 ns 1.25
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5603937.5 ns 3995271 ns 1.40
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 10187834 ns 3778020.5 ns 2.70
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 247421 ns 254416 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37946938 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11824125 ns 12000083 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1245751 ns 1240627 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 4000 ns 3917 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 4166 ns 3958 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33191 ns 35129 ns 0.94
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1226175 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 178041 ns 181625 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40780 ns 42720 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 16209 ns 15500 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 16000 ns 15708 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 16375 ns 16084 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15708 ns 15834 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 267073.5 ns 276415 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9646077.5 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 855042 ns 889271 ns 0.96
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 164241.5 ns 176511 ns 0.93
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404791 ns 404209 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295667 ns 295395.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 294792 ns 295625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760625 ns 760584 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113148 ns 113822.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1051645.5 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 413438 ns 409229 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89061 ns 92275.5 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1489875 ns 1418500 ns 1.05
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1163354 ns 1143416 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1163000 ns 1157042 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2466958 ns 2464062 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 262128 ns 252054 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 10492873 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1875708 ns 1932667 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 355908 ns 360264 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4042 ns 458 ns 8.83
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3875 ns 542 ns 7.15
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4584 ns 583 ns 7.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3959 ns 500 ns 7.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26052 ns 26614 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1182454 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 293750 ns 362145.5 ns 0.81
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 209722 ns 209492 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 15187.5 ns 8375 ns 1.81
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 15208 ns 8542 ns 1.78
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 17125 ns 9125 ns 1.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 15229 ns 8208 ns 1.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 212892 ns 219325 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25606131.5 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5648583 ns 6248208.5 ns 0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 694801.5 ns 707647 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 835562.5 ns 835021 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 620125 ns 618583 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 620687.5 ns 620791 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1544666.5 ns 1547209 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 132526.5 ns 131693 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 168801 ns 167721 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2686104.5 ns 2699249.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 2002979 ns 2010542 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2010625 ns 2008750 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4942145.5 ns 4923458 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 254138 ns 254591.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 866423 ns 880209 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3334 ns 333 ns 10.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3375 ns 375 ns 9
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6812.5 ns 375 ns 18.17
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2709 ns 291 ns 9.31
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32032 ns 32661 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1203251 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 272916 ns 283208 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 49530 ns 49561 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 13291.5 ns 7417 ns 1.79
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 13000 ns 7417 ns 1.75
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 14708 ns 7958 ns 1.85
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 13083 ns 7083 ns 1.85
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 224632 ns 230552 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22001406 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4935958 ns 5450104 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 370213 ns 374993 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2421500 ns 2388875 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2393375 ns 2390042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2375042 ns 2387625 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2390396 ns 2385041 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 219055 ns 222782 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8092339.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1468042 ns 1608854 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 359023 ns 336514 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4647833 ns 4653250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4651167 ns 4641333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4677124.5 ns 4667333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4648750 ns 4656333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 974219 ns 986732 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46807279 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6424625 ns 6571104 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1416527.5 ns 1423514 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 22500 ns 6875 ns 3.27
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7083 ns 7396 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7209 ns 7583 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6792 ns 6958.5 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23346 ns 24376 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1158059.5 ns
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 260417 ns 275584 ns 0.94
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 38300 ns 34810 ns 1.10
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 49750 ns 33521 ns 1.48
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 49520.5 ns 33500 ns 1.48
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33000 ns 33583 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 46542 ns 32667 ns 1.42
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 235206 ns 243530.5 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10567375 ns
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2058708 ns 2038145.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 238552 ns 242918 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 23375 ns 21625 ns 1.08
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 25562.5 ns 26250 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 23604 ns 25209 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5042 ns 5167 ns 0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA 17704 ns 18282 ns 0.97
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 85431 ns 86261 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12292 ns 11875 ns 1.04
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10542 ns 10417 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10708 ns 10833 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18000 ns 17792 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 246035 ns 249417 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 391284 ns 378534 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406625 ns 406250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297042 ns 297375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296584 ns 296750 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762208 ns 762958 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46469 ns 47260 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1389885 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 432958 ns 509104 ns 0.85
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 90831 ns 89561 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1487458 ns 1445500 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1169791 ns 1166562.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1167417 ns 1168167 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2472979.5 ns 2472542 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 303943 ns 314496 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 11208548 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2046584 ns 2114437.5 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 378743 ns 384754 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 4011667 ns 434833.5 ns 9.23
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2111583 ns 436583 ns 4.84
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2016833.5 ns 436750 ns 4.62
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 7086979.5 ns 447625 ns 15.83
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54688 ns 55692 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1025990 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1128125 ns 1118708.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 237392 ns 238023 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 7118333 ns 3881625 ns 1.83
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4967562.5 ns 4013979 ns 1.24
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5565146.5 ns 4029083 ns 1.38
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 9774125 ns 3805271 ns 2.57
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 267660 ns 274092 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30789701 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10301187.5 ns 10308938 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1242376 ns 1240392 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8792 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7833 ns 7666 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7834 ns 7709 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12416 ns 12417 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24327 ns 24383 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2162560 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 218521 ns 228000 ns 0.96
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 217622 ns 220382 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46041 ns 44708 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45542 ns 45250 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45667 ns 45208 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45042 ns 45209 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 357694 ns 367981 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12955258 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1668416.5 ns 1846667 ns 0.90
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 673576 ns 666456 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3128875 ns 83167 ns 37.62
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1714291 ns 83416 ns 20.55
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2883292 ns 83917 ns 34.36
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 4743542 ns 94583 ns 50.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190024.5 ns 190250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6261245 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2084333.5 ns 2072000 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204737 ns 172412 ns 1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5745041.5 ns 1982729 ns 2.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3425542 ns 2023063 ns 1.69
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3393458.5 ns 2022000 ns 1.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 9080250 ns 2016958 ns 4.50
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 572657 ns 583620.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27146313 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9775416 ns 9865250 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1095715 ns 1098935.5 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/ka_cpu branch 4 times, most recently from ca5d8df to f524b7e Compare August 22, 2024 16:58
[skip tests]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant