Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up LCE-A kernel #1910

Closed
wants to merge 1 commit into from
Closed

Conversation

sdaulton
Copy link
Contributor

Summary:
The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in forward and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

  • 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

  • 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)

  • Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

  • 8 contexts:
    • Forward pass: 20ms (CPU), 45.2ms (CUDA)
    • Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
  • 128 contexts:
    • Forward pass: 5.08s (CPU), 12s (CUDA)
    • Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:

  • 8 contexts:
    • Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
    • Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
  • 128 contexts:
    • Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
    • Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Differential Revision: D47118335

@facebook-github-bot facebook-github-bot added CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported labels Jun 29, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Merging #1910 (7df50d4) into main (28b1b2b) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 7df50d4 differs from pull request most recent head ed2ecf1. Consider uploading reports for the commit ed2ecf1 to get more accurate results

@@            Coverage Diff            @@
##              main     #1910   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          173       173           
  Lines        15232     15264   +32     
=========================================
+ Hits         15232     15264   +32     
Impacted Files Coverage Δ
botorch/models/contextual.py 100.00% <ø> (ø)
botorch/models/kernels/contextual_lcea.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sdaulton added a commit to sdaulton/botorch that referenced this pull request Jun 30, 2023
Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Differential Revision: D47118335

fbshipit-source-id: 4faf47e8919ce7a6d31f24c0488ccaad59ccc021
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

sdaulton added a commit to sdaulton/botorch that referenced this pull request Jun 30, 2023
Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Differential Revision: D47118335

fbshipit-source-id: 6efd771d7940139f2d4a342ad4b78c85c7ba84bd
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

sdaulton added a commit to sdaulton/botorch that referenced this pull request Jun 30, 2023
Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Differential Revision: D47118335

fbshipit-source-id: 46de6275a7f5f0f46fc7040e6d07918310e35c9d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

sdaulton added a commit to sdaulton/botorch that referenced this pull request Jul 3, 2023
Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Reviewed By: Balandat

Differential Revision: D47118335

fbshipit-source-id: a777cba457c2918b28a3a86dd6a1516aee91e066
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

sdaulton added a commit to sdaulton/botorch that referenced this pull request Jul 3, 2023
Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Reviewed By: Balandat

Differential Revision: D47118335

fbshipit-source-id: 82679f1565c3b8c033d45652d144f4874a54916a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

sdaulton added a commit to sdaulton/botorch that referenced this pull request Jul 3, 2023
Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Reviewed By: Balandat

Differential Revision: D47118335

fbshipit-source-id: bcea4e0863f171a66569e42ec9aeec5379990b7b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary:
Pull Request resolved: pytorch#1910

X-link: facebook/Ax#1694

The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

* 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)

* 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
* Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

* 8 contexts:
  * Forward pass: 20ms (CPU), 45.2ms (CUDA)
  * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
* 128 contexts:
  * Forward pass: 5.08s (CPU), 12s (CUDA)
  * Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:
* 8 contexts:
  * Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
  * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
* 128 contexts:
  * Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
  * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Reviewed By: Balandat

Differential Revision: D47118335

fbshipit-source-id: efa091bf7ccc30559d921195e10c1375f525cca9
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D47118335

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7eb847a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants