Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of 32 element tuples without dynamic allocation #40468

Merged
merged 1 commit into from
Apr 16, 2021
Merged

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Apr 13, 2021

Fixes JuliaGPU/GPUArrays.jl#333 and JuliaGPU/GPUArrays.jl#340 and restores behaviour compatible with #27398

cc @GiggleLiu, @maleadt

@kshyatt kshyatt added the needs nanosoldier run This PR should have benchmarks run on it label Apr 13, 2021
@GiggleLiu
Copy link
Contributor

Thanks @kshyatt , your PR fixes the problem for me.

Julia 1.6

julia> using BenchmarkTools

julia> ci = CartesianIndices((fill(2, 20)...,));

julia> @benchmark $ci[35]
BenchmarkTools.Trial: 
  memory estimate:  1.86 KiB
  allocs estimate:  31
  --------------
  minimum time:     1.617 μs (0.00% GC)
  median time:      1.666 μs (0.00% GC)
  mean time:        1.807 μs (1.25% GC)
  maximum time:     60.471 μs (95.66% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> ci0 = CartesianIndices((fill(2, 10)...,));

julia> @benchmark $ci0[35]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.384 ns (0.00% GC)
  median time:      48.379 ns (0.00% GC)
  mean time:        49.057 ns (0.00% GC)
  maximum time:     74.505 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

This branch

julia> using BenchmarkTools

julia> ci = CartesianIndices((fill(2, 20)...,));

julia> @benchmark $ci[35]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     124.371 ns (0.00% GC)
  median time:      125.541 ns (0.00% GC)
  mean time:        127.415 ns (0.00% GC)
  maximum time:     245.769 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     898

julia> ci0 = CartesianIndices((fill(2, 10)...,));

julia> @benchmark $ci0[35]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     46.852 ns (0.00% GC)
  median time:      47.403 ns (0.00% GC)
  mean time:        48.413 ns (0.00% GC)
  maximum time:     111.554 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

@oscardssmith
Copy link
Member

Does this PR have notable TTFP implications?

@kshyatt
Copy link
Contributor Author

kshyatt commented Apr 14, 2021

Honestly not sure. Hopefully the extension from 16-32 does not affect compilation too harshly but we should test this/get confirmation that it's unlikely to.

@JeffBezanson
Copy link
Sponsor Member

Seems reasonable to me. 16 was pretty arbitrary anyway. I don't see any significant differences in build time, sysimg size, or TTFP.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 15, 2021

I thought I'd drawn a performance plot when adding this method in #39414, but it seems I must not have uploaded it. Could you run the original benchmark (#39175) for different sizes of n (say, from 2:256)? It should show some quadratic behavior, iirc, with breaks at a few different points (including this point). I think this PR should be fine, but that would help confirm it.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Apr 15, 2021

Just clarify the number 32 a bit. 32 is a good number because a double precision tensor of rank 32 with size of each dimension 2 just fits the 32GB GPU.

julia> 1<<32 * 8 / (1<<30)
32.0

This PR is to fix the problem of high order tensors can not be indexed on GPU kernels. @kshyatt and I encounters this problem independently during researches about tensor networks, and the number 32 will solve the issue for us. Also, the above benchmark aims to show the allocation rather than the speed up.

@vtjnash I am glad to run the benchmark. But not quite understand what do you mean by the original benchmark and what is the n here? I just reproduced the first benchmark in the issue

julia> # on julia 1.7 (this PR)
       @btime SymmetricGroup(20)
  981.900 ns (39 allocations: 3.72 KiB)

julia> # on julia 1.6.0
       @btime SymmetricGroup(20)
  1.065 μs (39 allocations: 3.72 KiB)

It seems not what we are looking for?

@kshyatt
Copy link
Contributor Author

kshyatt commented Apr 15, 2021

With this PR:

julia>  @time @eval begin using Plots; unicodeplots(); x=plot(sin); display(x) end
29.816768 seconds (35.54 M allocations: 2.121 GiB, 4.96% gc time, 0.03% compilation time)

on master:

julia>  @time @eval begin using Plots; unicodeplots(); x=plot(sin); display(x) end
29.365071 seconds (35.64 M allocations: 2.122 GiB, 6.11% gc time, 0.03% compilation time)

@kshyatt
Copy link
Contributor Author

kshyatt commented Apr 16, 2021

@vtjnash ran the benchmarks and here are the results (all times on the y axis in microseconds)

image

Sorry for the messed up labelling. The blue is master and orange is this PR. The x axis is n.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 16, 2021

@nanosoldier runbenchmarks(ALL, vs=":master")

@vtjnash vtjnash removed the needs nanosoldier run This PR should have benchmarks run on it label Apr 16, 2021
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@kshyatt
Copy link
Contributor Author

kshyatt commented Apr 16, 2021

What do we think? Are the nanosoldier results concerning or can we go forward?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 16, 2021

I don't think I see anything changed. @nanosoldier runbenchmarks("foldl", vs=":master")

@vtjnash vtjnash added the status:merge me PR is reviewed. Merge when all tests are passing label Apr 16, 2021
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @christopher-dG

@kshyatt kshyatt merged commit fbffe7c into master Apr 16, 2021
@kshyatt kshyatt deleted the ksh/tuple32 branch April 16, 2021 16:07
@kshyatt kshyatt removed the status:merge me PR is reviewed. Merge when all tests are passing label Apr 16, 2021
@JeffBezanson
Copy link
Sponsor Member

It looks like if nanosoldier is run twice on the same PR, the old result gets overwritten? I don't think it used to do that, did it?

@maleadt
Copy link
Member

maleadt commented Apr 16, 2021

Yeah that's JuliaCI/Nanosoldier.jl#83. It used to link to a git hash, but that broke rewriting the NanosoldierReports history (which may need to happen once in a while because the repo just grows huge).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

permutedims errors on high dimensional tensors
7 participants