Native libraries use mimalloc as global allocator #1249

idavis · 2024-03-10T01:36:05Z

This PR moves the global allocator to MS's mimalloc. The performance gains are significant, and it adds 98KB to the Python wheel and 215KB to the native dylib used in the python wheel. This dependency adds 2.5s to the build time.

We can investigate wasm usage in the future, but this would likely require having a wasi-libc installation available for the build process to give mimalloc a wasi compatible libc to link against.

Benchmarks on Apple M1 Max 64GB

Teleport evaluation     time:   [44.024 µs 44.043 µs 44.063 µs]
                        change: [-21.789% -21.595% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.

Deutsch-Jozsa evaluation
                        time:   [3.0302 ms 3.0336 ms 3.0375 ms]
                        change: [-23.878% -23.671% -23.493%] (p = 0.00 < 0.05)
                        Performance has improved.

Large file parity evaluation
                        time:   [28.956 ms 28.972 ms 28.989 ms]
                        change: [-1.9178% -1.8288% -1.7431%] (p = 0.00 < 0.05)
                        Performance has improved.

Array append evaluation time:   [306.58 µs 306.78 µs 307.02 µs]
                        change: [-14.312% -14.068% -13.881%] (p = 0.00 < 0.05)
                        Performance has improved.

Array update evaluation time:   [304.71 µs 304.84 µs 305.00 µs]
                        change: [-10.777% -10.651% -10.537%] (p = 0.00 < 0.05)
                        Performance has improved.

Array literal evaluation
                        time:   [144.03 µs 144.32 µs 144.68 µs]
                        change: [-39.800% -39.455% -39.171%] (p = 0.00 < 0.05)
                        Performance has improved.

Large nested iteration  time:   [30.418 ms 30.446 ms 30.476 ms]
                        change: [-11.368% -11.252% -11.131%] (p = 0.00 < 0.05)
                        Performance has improved.

Large input file        time:   [15.869 ms 15.883 ms 15.900 ms]
                        change: [-30.391% -30.267% -30.147%] (p = 0.00 < 0.05)
                        Performance has improved.

Standard library        time:   [9.5701 ms 9.5801 ms 9.5919 ms]
                        change: [-28.527% -28.371% -28.224%] (p = 0.00 < 0.05)
                        Performance has improved.

github-actions · 2024-03-10T01:44:49Z

Benchmark for `01a59b2`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	757.5±5.36µs	740.1±4.34µs	-2.30%
Array literal evaluation	563.7±45.41µs	291.8±2.25µs	-48.23%
Array update evaluation	752.7±2.70µs	748.3±3.94µs	-0.58%
Deutsch-Jozsa evaluation	7.0±0.06ms	6.2±0.05ms	-11.43%
Large file parity evaluation	35.1±0.08ms	35.0±0.44ms	-0.28%
Large input file	41.2±1.00ms	31.5±0.85ms	-23.54%
Large nested iteration	75.1±3.08ms	76.8±0.91ms	+2.26%
Standard library	23.2±0.65ms	17.1±0.47ms	-26.29%
Teleport evaluation	95.9±1.55µs	89.8±1.97µs	-6.36%

swernli

I love it!

allocator/mimalloc-sys/build.rs

github-actions · 2024-03-12T18:19:23Z

Benchmark for `634f970`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	758.1±4.86µs	749.9±3.88µs	-1.08%
Array literal evaluation	539.2±29.12µs	332.8±2.60µs	-38.28%
Array update evaluation	753.5±6.02µs	756.4±3.21µs	+0.38%
Deutsch-Jozsa evaluation	7.0±0.05ms	6.6±0.10ms	-5.71%
Large file parity evaluation	35.3±0.73ms	35.4±0.73ms	+0.28%
Large input file	39.9±0.74ms	30.1±0.71ms	-24.56%
Large nested iteration	76.7±1.21ms	77.0±0.89ms	+0.39%
Standard library	22.8±0.56ms	16.9±0.36ms	-25.88%
Teleport evaluation	96.2±1.50µs	95.1±1.68µs	-1.14%

github-actions · 2024-03-12T18:43:11Z

Benchmark for `4acde1f`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	760.0±6.16µs	741.8±3.87µs	-2.39%
Array literal evaluation	513.6±9.50µs	303.5±10.63µs	-40.91%
Array update evaluation	764.8±23.59µs	754.0±4.49µs	-1.41%
Deutsch-Jozsa evaluation	7.0±0.15ms	6.4±0.05ms	-8.57%
Large file parity evaluation	35.5±0.18ms	35.6±0.65ms	+0.28%
Large input file	41.7±1.11ms	34.0±1.42ms	-18.47%
Large nested iteration	75.4±1.21ms	77.8±0.83ms	+3.18%
Standard library	23.4±0.60ms	17.6±0.41ms	-24.79%
Teleport evaluation	95.8±2.51µs	94.6±1.44µs	-1.25%

billti · 2024-03-12T18:45:03Z

Does this mean we have a dependency on a C compiler in our build now also if users want to set up a Q# development environment? (Which I guess we kind of implicitly do anyway as Rust on Windows requires MSVC or C++ Build Tools are installed, Linux generally has a C compiler already... but I'm not sure if that's already a requirement on macOS).

idavis · 2024-03-12T19:12:58Z

Does this mean we have a dependency on a C compiler in our build now also if users want to set up a Q# development environment? (Which I guess we kind of implicitly do anyway as Rust on Windows requires MSVC or C++ Build Tools are installed, Linux generally has a C compiler already... but I'm not sure if that's already a requirement on macOS).

Yes, you need to have a C compiler handy. I removed the dep on Ninja so that you don't need the generator. You also need cmake on your path. We can get rid of the cmake dep, but we'd need a submodule.

github-actions · 2024-03-12T19:19:54Z

Benchmark for `52c76e8`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	758.1±11.58µs	742.7±7.77µs	-2.03%
Array literal evaluation	505.8±5.76µs	300.2±3.20µs	-40.65%
Array update evaluation	754.0±16.55µs	751.7±4.36µs	-0.31%
Deutsch-Jozsa evaluation	7.0±0.05ms	6.5±0.06ms	-7.14%
Large file parity evaluation	35.1±0.07ms	35.3±0.30ms	+0.57%
Large input file	40.4±0.93ms	31.4±0.64ms	-22.28%
Large nested iteration	77.5±1.14ms	77.1±1.68ms	-0.52%
Standard library	22.8±0.58ms	17.1±0.57ms	-25.00%
Teleport evaluation	95.7±2.38µs	94.2±1.63µs	-1.57%

allocator/mimalloc-sys/build.rs

github-actions · 2024-03-12T19:38:29Z

Benchmark for `5c3e644`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	755.6±4.55µs	758.4±5.45µs	+0.37%
Array literal evaluation	513.0±25.59µs	330.5±3.80µs	-35.58%
Array update evaluation	751.4±4.77µs	757.2±2.81µs	+0.77%
Deutsch-Jozsa evaluation	7.0±0.06ms	6.5±0.07ms	-7.14%
Large file parity evaluation	35.2±0.06ms	35.2±0.06ms	0.00%
Large input file	42.6±1.08ms	38.5±0.69ms	-9.62%
Large nested iteration	76.1±0.39ms	74.9±0.55ms	-1.58%
Standard library	23.6±0.58ms	19.7±0.55ms	-16.53%
Teleport evaluation	96.2±3.41µs	94.1±1.76µs	-2.18%

github-actions · 2024-03-12T20:21:07Z

Benchmark for `72e6dac`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	756.3±8.54µs	743.0±2.91µs	-1.76%
Array literal evaluation	508.3±12.07µs	304.7±10.57µs	-40.06%
Array update evaluation	757.5±4.23µs	757.3±3.92µs	-0.03%
Deutsch-Jozsa evaluation	7.1±0.07ms	6.5±0.10ms	-8.45%
Large file parity evaluation	35.8±0.16ms	35.5±0.42ms	-0.84%
Large input file	41.5±1.48ms	30.2±0.98ms	-27.23%
Large nested iteration	76.3±0.42ms	78.1±0.57ms	+2.36%
Standard library	23.3±0.50ms	16.8±0.10ms	-27.90%
Teleport evaluation	95.8±1.72µs	95.6±2.39µs	-0.21%

github-actions · 2024-03-12T22:53:23Z

Benchmark for `1f33ca3`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	756.6±4.03µs	1066.5±6.96µs	+40.96%
Array literal evaluation	512.1±11.12µs	1260.6±13.24µs	+146.16%
Array update evaluation	761.3±45.57µs	891.5±4.91µs	+17.10%
Deutsch-Jozsa evaluation	7.2±0.05ms	13.6±0.40ms	+88.89%
Large file parity evaluation	35.2±0.13ms	39.2±0.16ms	+11.36%
Large input file	39.7±0.69ms	74.6±0.54ms	+87.91%
Large nested iteration	77.1±0.54ms	94.1±3.85ms	+22.05%
Standard library	22.6±0.55ms	40.3±0.47ms	+78.32%
Teleport evaluation	96.9±1.91µs	184.7±4.99µs	+90.61%

github-actions · 2024-03-12T23:58:14Z

Benchmark for `fc1ff57`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	759.3±4.50µs	750.8±2.99µs	-1.12%
Array literal evaluation	520.0±21.55µs	326.6±2.10µs	-37.19%
Array update evaluation	755.6±8.69µs	754.1±2.46µs	-0.20%
Deutsch-Jozsa evaluation	7.0±0.08ms	6.3±0.07ms	-10.00%
Large file parity evaluation	35.2±0.14ms	35.3±0.78ms	+0.28%
Large input file	40.3±0.89ms	33.4±0.72ms	-17.12%
Large nested iteration	77.6±1.17ms	77.8±0.92ms	+0.26%
Standard library	23.1±0.63ms	18.9±0.95ms	-18.18%
Teleport evaluation	95.8±1.65µs	91.3±1.51µs	-4.70%

github-actions · 2024-03-13T17:24:58Z

Benchmark for `0579203`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	755.5±2.93µs	751.1±11.03µs	-0.58%
Array literal evaluation	511.4±15.14µs	296.0±3.26µs	-42.12%
Array update evaluation	751.6±5.11µs	757.3±13.93µs	+0.76%
Deutsch-Jozsa evaluation	7.0±0.11ms	6.4±0.33ms	-8.57%
Large file parity evaluation	35.2±0.09ms	35.1±0.14ms	-0.28%
Large input file	40.8±1.41ms	30.2±1.59ms	-25.98%
Large nested iteration	76.2±0.61ms	77.8±1.08ms	+2.10%
Standard library	22.8±0.50ms	16.7±0.62ms	-26.75%
Teleport evaluation	95.9±3.34µs	92.6±1.58µs	-3.44%

idavis · 2024-03-13T17:32:17Z

Build warnings on macos when xcode isn't installed will be fixed by rust-lang/cc-rs#1007

github-actions · 2024-03-13T17:46:53Z

Benchmark for `f40f724`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	756.7±6.54µs	765.3±3.10µs	+1.14%
Array literal evaluation	521.8±34.67µs	297.5±5.11µs	-42.99%
Array update evaluation	756.8±23.61µs	771.0±3.46µs	+1.88%
Deutsch-Jozsa evaluation	7.0±0.06ms	6.3±0.09ms	-10.00%
Large file parity evaluation	35.3±0.15ms	35.3±1.08ms	0.00%
Large input file	42.9±1.46ms	33.3±0.64ms	-22.38%
Large nested iteration	74.8±0.64ms	81.6±3.24ms	+9.09%
Standard library	24.1±0.73ms	18.5±0.90ms	-23.24%
Teleport evaluation	96.0±2.89µs	93.8±1.88µs	-2.29%

build.py

Cargo.toml

allocator/mimalloc-sys/CMakeLists.txt

github-actions · 2024-03-13T22:44:07Z

Benchmark for `7cba09a`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	757.1±4.59µs	750.8±7.21µs	-0.83%
Array literal evaluation	539.4±16.03µs	309.7±2.05µs	-42.58%
Array update evaluation	751.6±2.71µs	758.6±13.36µs	+0.93%
Deutsch-Jozsa evaluation	7.0±0.23ms	6.3±0.05ms	-10.00%
Large file parity evaluation	35.0±0.11ms	34.9±0.14ms	-0.29%
Large input file	39.7±1.02ms	30.5±2.21ms	-23.17%
Large nested iteration	77.1±0.67ms	78.1±1.20ms	+1.30%
Standard library	23.0±0.52ms	16.4±0.29ms	-28.70%
Teleport evaluation	96.7±1.97µs	92.1±2.46µs	-4.76%

github-actions · 2024-03-13T23:26:17Z

Benchmark for `c2294fb`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	755.2±4.93µs	752.6±47.80µs	-0.34%
Array literal evaluation	517.6±24.20µs	297.9±1.94µs	-42.45%
Array update evaluation	752.9±3.65µs	750.4±4.78µs	-0.33%
Deutsch-Jozsa evaluation	7.0±0.05ms	6.2±0.03ms	-11.43%
Large file parity evaluation	35.0±0.21ms	34.8±0.58ms	-0.57%
Large input file	39.5±1.05ms	33.6±1.76ms	-14.94%
Large nested iteration	76.3±0.57ms	77.8±1.89ms	+1.97%
Standard library	22.8±0.46ms	18.7±1.59ms	-17.98%
Teleport evaluation	95.7±1.77µs	91.4±2.38µs	-4.49%

allocator/mimalloc-sys/Cargo.toml

README.md

allocator/Cargo.toml

allocator/mimalloc-sys/build.rs

minestarks

I'm frankly on the fence about taking this. The perf gains are undeniable, but I have several reservations about the change overall.

WASM is one our two main use cases and this change wouldn't improve it.
I'd rather reassess impact after we look into optimizations that reduce allocations across the board, such as Optimize evaluator perf via Control Flow Graph #1261 . I still think we have a lot of low hanging fruit here.

If we do decide it's worth it, these would make the change more palatable for me:

An easy way to opt out of the CMake/C dependency for local builds. We have external contributors and we love them! One more prereq before you can build at all can be a barrier to engagement.
Can we use an external implementation via crates.io? This code is a perfect candidate for just using OSS, it's a very general implementation, and exactly the kind of code I wouldn't want to be stuck maintaining.

github-actions · 2024-03-14T18:23:56Z

Benchmark for `c29e736`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	753.7±7.46µs	757.7±4.38µs	+0.53%
Array literal evaluation	507.8±21.43µs	328.0±1.76µs	-35.41%
Array update evaluation	750.6±3.38µs	753.8±5.50µs	+0.43%
Deutsch-Jozsa evaluation	7.0±0.04ms	6.3±0.05ms	-10.00%
Large file parity evaluation	35.0±0.12ms	35.1±0.66ms	+0.29%
Large input file	40.8±2.42ms	29.7±0.95ms	-27.21%
Large nested iteration	74.5±0.88ms	75.4±0.72ms	+1.21%
Standard library	22.8±0.53ms	16.4±0.25ms	-28.07%
Teleport evaluation	95.4±1.80µs	91.4±2.16µs	-4.19%

github-actions · 2024-03-14T18:38:00Z

Benchmark for `8eefe59`

Click to view benchmark

Test	Base	PR	%
Array append evaluation	753.3±5.40µs	741.6±3.01µs	-1.55%
Array literal evaluation	516.7±40.50µs	300.1±6.07µs	-41.92%
Array update evaluation	754.3±36.92µs	747.6±2.25µs	-0.89%
Deutsch-Jozsa evaluation	7.0±0.06ms	6.2±0.05ms	-11.43%
Large file parity evaluation	35.0±0.19ms	34.8±0.07ms	-0.57%
Large input file	39.9±0.85ms	28.7±0.49ms	-28.07%
Large nested iteration	78.6±5.23ms	77.5±2.09ms	-1.40%
Standard library	22.8±0.45ms	16.4±0.21ms	-28.07%
Teleport evaluation	96.0±1.50µs	92.2±3.99µs	-3.96%

minestarks

Thanks for addressing the concerns, excited about the perf improvement!

This reverts commit 93f01d0.

idavis self-assigned this Mar 10, 2024

idavis requested review from minestarks, swernli and billti as code owners March 10, 2024 01:36

swernli approved these changes Mar 11, 2024

View reviewed changes

github-advanced-security bot found potential problems Mar 12, 2024

View reviewed changes

allocator/mimalloc-sys/build.rs Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Mar 12, 2024

View reviewed changes

allocator/mimalloc-sys/build.rs Fixed Show fixed Hide fixed

idavis requested a review from swernli March 13, 2024 00:00

idavis added 9 commits March 13, 2024 10:16

Native libraries use mimalloc as global allocator

8b09156

Implementing mimalloc-sys crate and dynamic mimalloc compilation

9d669d6

Remove ninja dep

8aba7d2

Static linking crt

10933e4

Adding some verbosity and clean path

9618cda

Seeing if windows crash is due to 2.x

b0d9542

Trying more windows flags.

86facfe

Simplifying compilation.

21e1eb5

Optimizing perf and setting up windows crt

db58fff

idavis force-pushed the iadavis/mimalloc branch from ed83a4a to db58fff Compare March 13, 2024 17:16

Updating building instructions

a8a4dcd

billti reviewed Mar 13, 2024

View reviewed changes

build.py Outdated Show resolved Hide resolved

billti reviewed Mar 13, 2024

View reviewed changes

Cargo.toml Show resolved Hide resolved

billti reviewed Mar 13, 2024

View reviewed changes

allocator/mimalloc-sys/CMakeLists.txt Show resolved Hide resolved

billti approved these changes Mar 13, 2024

View reviewed changes

Cleaning up

3f46b50

Move allocator asignement into macro.

516b407

swernli mentioned this pull request Mar 14, 2024

Optimize evaluator perf via Control Flow Graph #1261

Merged