Allow use of 32 element tuples without dynamic allocation #40468

kshyatt · 2021-04-13T19:59:23Z

Fixes JuliaGPU/GPUArrays.jl#333 and JuliaGPU/GPUArrays.jl#340 and restores behaviour compatible with #27398

GiggleLiu · 2021-04-13T20:47:29Z

Thanks @kshyatt , your PR fixes the problem for me.

Julia 1.6

julia> using BenchmarkTools

julia> ci = CartesianIndices((fill(2, 20)...,));

julia> @benchmark $ci[35]
BenchmarkTools.Trial: 
  memory estimate:  1.86 KiB
  allocs estimate:  31
  --------------
  minimum time:     1.617 μs (0.00% GC)
  median time:      1.666 μs (0.00% GC)
  mean time:        1.807 μs (1.25% GC)
  maximum time:     60.471 μs (95.66% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> ci0 = CartesianIndices((fill(2, 10)...,));

julia> @benchmark $ci0[35]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.384 ns (0.00% GC)
  median time:      48.379 ns (0.00% GC)
  mean time:        49.057 ns (0.00% GC)
  maximum time:     74.505 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

This branch

julia> using BenchmarkTools

julia> ci = CartesianIndices((fill(2, 20)...,));

julia> @benchmark $ci[35]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     124.371 ns (0.00% GC)
  median time:      125.541 ns (0.00% GC)
  mean time:        127.415 ns (0.00% GC)
  maximum time:     245.769 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     898

julia> ci0 = CartesianIndices((fill(2, 10)...,));

julia> @benchmark $ci0[35]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     46.852 ns (0.00% GC)
  median time:      47.403 ns (0.00% GC)
  mean time:        48.413 ns (0.00% GC)
  maximum time:     111.554 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

oscardssmith · 2021-04-14T17:57:59Z

Does this PR have notable TTFP implications?

kshyatt · 2021-04-14T18:07:57Z

Honestly not sure. Hopefully the extension from 16-32 does not affect compilation too harshly but we should test this/get confirmation that it's unlikely to.

JeffBezanson · 2021-04-15T20:02:53Z

Seems reasonable to me. 16 was pretty arbitrary anyway. I don't see any significant differences in build time, sysimg size, or TTFP.

vtjnash · 2021-04-15T20:24:47Z

I thought I'd drawn a performance plot when adding this method in #39414, but it seems I must not have uploaded it. Could you run the original benchmark (#39175) for different sizes of n (say, from 2:256)? It should show some quadratic behavior, iirc, with breaks at a few different points (including this point). I think this PR should be fine, but that would help confirm it.

GiggleLiu · 2021-04-15T21:09:25Z

Just clarify the number 32 a bit. 32 is a good number because a double precision tensor of rank 32 with size of each dimension 2 just fits the 32GB GPU.

julia> 1<<32 * 8 / (1<<30)
32.0

This PR is to fix the problem of high order tensors can not be indexed on GPU kernels. @kshyatt and I encounters this problem independently during researches about tensor networks, and the number 32 will solve the issue for us. Also, the above benchmark aims to show the allocation rather than the speed up.

@vtjnash I am glad to run the benchmark. But not quite understand what do you mean by the original benchmark and what is the n here? I just reproduced the first benchmark in the issue

julia> # on julia 1.7 (this PR)
       @btime SymmetricGroup(20)
  981.900 ns (39 allocations: 3.72 KiB)

julia> # on julia 1.6.0
       @btime SymmetricGroup(20)
  1.065 μs (39 allocations: 3.72 KiB)

It seems not what we are looking for?

kshyatt · 2021-04-15T21:42:44Z

With this PR:

julia>  @time @eval begin using Plots; unicodeplots(); x=plot(sin); display(x) end
29.816768 seconds (35.54 M allocations: 2.121 GiB, 4.96% gc time, 0.03% compilation time)

on master:

julia>  @time @eval begin using Plots; unicodeplots(); x=plot(sin); display(x) end
29.365071 seconds (35.64 M allocations: 2.122 GiB, 6.11% gc time, 0.03% compilation time)

kshyatt · 2021-04-16T02:17:31Z

@vtjnash ran the benchmarks and here are the results (all times on the y axis in microseconds)

Sorry for the messed up labelling. The blue is master and orange is this PR. The x axis is n.

vtjnash · 2021-04-16T02:47:58Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2021-04-16T09:01:52Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

kshyatt · 2021-04-16T15:36:14Z

What do we think? Are the nanosoldier results concerning or can we go forward?

vtjnash · 2021-04-16T15:49:41Z

I don't think I see anything changed. @nanosoldier runbenchmarks("foldl", vs=":master")

nanosoldier · 2021-04-16T16:01:03Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @christopher-dG

JeffBezanson · 2021-04-16T18:36:57Z

It looks like if nanosoldier is run twice on the same PR, the old result gets overwritten? I don't think it used to do that, did it?

maleadt · 2021-04-16T18:54:05Z

Yeah that's JuliaCI/Nanosoldier.jl#83. It used to link to a git hash, but that broke rewriting the NanosoldierReports history (which may need to happen once in a while because the repo just grows huge).

Allow use of 32 element tuples without dynamic allocation

9300073

kshyatt requested review from vtjnash and JeffBezanson April 13, 2021 19:59

kshyatt added the needs nanosoldier run This PR should have benchmarks run on it label Apr 13, 2021

JeffBezanson approved these changes Apr 15, 2021

View reviewed changes

vtjnash removed the needs nanosoldier run This PR should have benchmarks run on it label Apr 16, 2021

vtjnash added the status:merge me PR is reviewed. Merge when all tests are passing label Apr 16, 2021

kshyatt merged commit fbffe7c into master Apr 16, 2021

kshyatt deleted the ksh/tuple32 branch April 16, 2021 16:07

kshyatt removed the status:merge me PR is reviewed. Merge when all tests are passing label Apr 16, 2021

GiggleLiu mentioned this pull request Sep 16, 2021

[WIP] fix permutedims JuliaGPU/GPUArrays.jl#334

Closed

GiggleLiu mentioned this pull request Oct 1, 2021

Unreasonable slow just in time compiling of permutedims when the array dimension is high #42438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow use of 32 element tuples without dynamic allocation #40468

Allow use of 32 element tuples without dynamic allocation #40468

kshyatt commented Apr 13, 2021 •

edited

Loading

GiggleLiu commented Apr 13, 2021

oscardssmith commented Apr 14, 2021

kshyatt commented Apr 14, 2021

JeffBezanson commented Apr 15, 2021

vtjnash commented Apr 15, 2021 •

edited

Loading

GiggleLiu commented Apr 15, 2021 •

edited

Loading

kshyatt commented Apr 15, 2021

kshyatt commented Apr 16, 2021

vtjnash commented Apr 16, 2021

nanosoldier commented Apr 16, 2021

kshyatt commented Apr 16, 2021

vtjnash commented Apr 16, 2021

nanosoldier commented Apr 16, 2021

JeffBezanson commented Apr 16, 2021

maleadt commented Apr 16, 2021

Allow use of 32 element tuples without dynamic allocation #40468

Allow use of 32 element tuples without dynamic allocation #40468

Conversation

kshyatt commented Apr 13, 2021 • edited Loading

GiggleLiu commented Apr 13, 2021

Julia 1.6

This branch

oscardssmith commented Apr 14, 2021

kshyatt commented Apr 14, 2021

JeffBezanson commented Apr 15, 2021

vtjnash commented Apr 15, 2021 • edited Loading

GiggleLiu commented Apr 15, 2021 • edited Loading

kshyatt commented Apr 15, 2021

kshyatt commented Apr 16, 2021

vtjnash commented Apr 16, 2021

nanosoldier commented Apr 16, 2021

kshyatt commented Apr 16, 2021

vtjnash commented Apr 16, 2021

nanosoldier commented Apr 16, 2021

JeffBezanson commented Apr 16, 2021

maleadt commented Apr 16, 2021

kshyatt commented Apr 13, 2021 •

edited

Loading

vtjnash commented Apr 15, 2021 •

edited

Loading

GiggleLiu commented Apr 15, 2021 •

edited

Loading