updated README with Hessian kernel and BarnesHut examples

SebastianAment · May 10, 2022 · 3f09e3a · 3f09e3a
1 parent 9d2294e
commit 3f09e3a
Showing 1 changed file with 102 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -2,11 +2,20 @@
 [![CI](https://github.com/SebastianAment/CovarianceFunctions.jl/actions/workflows/CI.yml/badge.svg)](https://github.com/SebastianAment/CovarianceFunctions.jl/actions/workflows/CI.yml)
 [![codecov](https://codecov.io/gh/SebastianAment/CovarianceFunctions.jl/branch/main/graph/badge.svg?token=04NSNJC9H1)](https://codecov.io/gh/SebastianAment/CovarianceFunctions.jl)
 
-`CovarianceFunctions.jl`'s primary goal are efficient computations with kernel matrices, also called `Gramian`'s.
+CovarianceFunctions.jl's primary goal are efficient computations with kernel matrices, also called `Gramian`'s.
 To this end, the package implements a *lazy* `Gramian` type, which
 can be used to solve linear algebraic problems arising in
 kernel methods efficiently without running out of memory.
-Further, the code automatically recognizes when certain linear algebraic structures are present and exploits them for computational efficiency.
+Further, the code automatically recognizes when certain linear algebraic structures are present and exploits them for computational efficiency, in particular for fast matrix-vector multiplications (MVMs).
+
+The package offers
+- lazy representations of kernel matrices with `O(1)` memory footprint and efficient, parallelized MVMs.
+- `O(n²d)` exact MVMs with kernel matrices arising from **gradient observations**
+in contrast to the naïve `O(n²d²)` complexity.
+- `O(n²d²)` exact MVMs with kernel matrices arising from **Hessian observations** in contrast to the naïve `O(n²d⁴)` complexity.
+- `O(n⋅log(n))` exact MVMs and `O(n²)` linear solves with isotropic kernel matrices on a uniform grid, automatically taking advantage of **Toeplitz structure**.
+- `O(n⋅log(n))` approximate MVMs with isotropic kernel matrices using a type of **Barnes-Hut algorithm**.
+- `O(n⋅log(n))` **sparsification** of isotropic kernel matrices with subsequent `O(k)` MVMs, where `k` is the number of non-zeros.
 
 ## Basic Usage
 This example shows how to construct a kernel matrix using the `gramian` function and highlights the small memory footprint of the lazy representation and the matrix-vector multiplication with `mul!`.
@@ -23,7 +32,7 @@ size(K)
   (16384, 16384)
 
 a = randn(n);
-b = zero(b);
+b = zero(a);
 @time mul!(b, K, a); # multiplying with K allocates little memory
     0.584813 seconds (51 allocations: 4.875 KiB)
 ```
@@ -138,9 +147,10 @@ x_naive ≈ x_fast
 When conditioning Gaussian processes on gradient information,
 it is necessary to work with `d × d` *matrix-valued* gradient kernels,
 where `d` is the dimension of the input.
-CovarianceFunctions.jl implements an automatic structure derivation engine with a `O(d)` complexity for a large range of kernel functions
-and contains a generic fallback with the regular `O(d²)` complexity.
+[Roos et al.](https://proceedings.mlr.press/v139/de-roos21a/de-roos21a.pdf) first noted that isotropic and dot product kernels give rise to gradient kernel matrices with a data-sparse structure and proposed a direct method with an `O(n²d + n⁶)` complexity for the low-data regime, where the number of observations `n` is small.
 
+CovarianceFunctions.jl implements an automatic structure derivation engine for a large range of kernel functions, including complex combinations of kernels and the neural network kernel, permitting a matrix-vector product in `O(n²d)` operations with gradient kernel matrices.
+It also contains a generic fallback with the regular `O(n²d²)` complexity for cases where no special structure is present or currently implemented.
 For example,
 ```julia
 using CovarianceFunctions
@@ -154,11 +164,11 @@ x = [randn(d) for _ in 1:n]; # data is vector of vectors
 size(G) # G is n*d by n*d
   (1048576, 1048576)
 a = randn(n*d);
-b = zero(b);
+b = zero(a);
 @time mul!(b, G, a); # multiplying with G allocates little memory
   0.394388 seconds (67 allocations: 86.516 KiB)
 ```
-Note that the last multiplication was with a million by million matrix,
+Note that the last multiplication was with a **million by million matrix**,
 which would be impossible without CovarianceFunctions.jl's lazy and structured representation of the gradient kernel matrix.
 
 Furthermore, CovarianceFunctions.jl can compute structured representations of more complex and composite kernels maintaining the fast `O(d)` matrix vector multiply.
@@ -183,9 +193,40 @@ Basic input traits amenable to specializations are `IsotropicInput`, `DotProduct
 Further transformations and combinations of kernels are also supported, as well as efficient `O(d²)` operations with certain Hessian kernels, in constrast to the naïve `O(d⁴)` complexity.
 The main files containing the implementation are src/gradient.jl, src/gradient_algebra.jl, and src/hessian.jl
 
+## Hessian Kernels
+In addition to the special structure for gradient kernels,
+CovarianceFunctions.jl implements a similar structure for kernels arising from Hessian observations, reducing the complexity of multiplying with Hessian kernel matrices from `O(n²d⁴)` to `O(n²d²)`.
+The support for Hessian kernels is less comprehensive than for gradient kernels, but includes isotropic and dot product kernels.
+For example,
+```julia
+using CovarianceFunctions
+using LinearAlgebra
+k = CovarianceFunctions.EQ(); # exponentiated quadratic (RBF) kernel
+h = CovarianceFunctions.HessianKernel(k);
+d, n = 16, 128; # keeping dimension moderate
+x = [randn(d) for _ in 1:n]; # data is vector of vectors
+@time G = gramian(h, x); # instantiating lazy gradient kernel Gramian matrix
+0.000026 seconds (59 allocations: 54.172 KiB)
+size(G) # G is n*d² by n*d²
+  (32768, 32768)
+a = randn(n*d^2);
+b = zero(a);
+@time mul!(b, G, a); # multiplying with G is fast
+  0.076744 seconds (12.81 M allocations: 499.100 MiB, 34.14% gc time)
+```
+This is in contrast with the naïve approach, since even instantiating the Hessian kernel matrix in memory takes orders of magnitude more time and memory:
+```julia
+@time M = Matrix(G);
+  5.617745 seconds (1.18 M allocations: 64.722 GiB, 33.33% gc time)
+```
+
+Notably, the multiplication with the Hessian kernel matrix scales linearly in `d²`, the amount of information gathered per Hessian observation.
+While the implementation scales well, the constants and memory allocations are at this point not as optimized as for the gradient kernel matrices.
+Feel free to reach out if you'd benefit from a further improved implementation of the Hessian kernel, and I'll fast-track it.
+
 ## Sparsification
-In high dimensions, kernel matrices associated with exponentially-decaying kernels are very likely sparse.
-This is because the average distance between points grows with increasing dimension.
+Depending on the data distribution, kernel matrices associated with exponentially-decaying kernels can be approximately sparse.
+This is particularly likely in higher dimensions, since the average distance of uniformaly distributed points in e.g. the unit hyper-cube grows with the dimension.
 CovarianceFunctions.jl contains a sparsification algorithm that extends SparseArrays.jl's `sparse` function,
 guaranteeing a user-defined element-wise accuracy.
 ```julia
@@ -199,26 +240,77 @@ G = gramian(k, X);
 
 @time S = sparse(G, 1e-6); # sparsification with 1e-6 tolerance
   7.208343 seconds (407.54 k allocations: 73.094 MiB)
+```
+Looking at the number of non-zeros in `S`, we see it is highly sparse:
+```julia
 nnz(S) # number of non-zeros in S
   594480
 nnz(S) / n^2 # over 99% of entries are sparse
   0.0022146105766296387
 ```
+
 Subsequent matrix-vector multiplications with `S` are very fast:
 ```julia
+a = randn(n)
+b = zero(a)
 @time mul!(b, S, a); # sparse multiply
     0.000451 seconds
+```
+while the lazy dense multiply takes three orders of magnitude longer:
+```julia
 @time mul!(b, G, a); # lazy dense multiply
   0.949835 seconds (61 allocations: 5.594 KiB)
 ```
 The bottleneck in the computation of `S` is NearestNeighbors.jl's `inrange` function.
-We believe that it is possible to further accelerate this computation.
+While in principle the nearest neighbors search is an `O(n⋅log(n))` operation,
+the complexity grows exponentially with the intrinsic dimension of the data.
+We believe that it is possible to further [accelerate this computation through parallelization](https://github.com/KristofferC/NearestNeighbors.jl/pull/131).
+Note that the sparsification in this case takes as much time as 7 lazy dense multiplications, and would therefore break even quickly if an iterative method is used.
 
 ## Barnes-Hut
 The Barnes-Hut algorithm has its origins in accelerating gravity simulutations, but a variant of it can be applied to kernel matrices arising from more general kernels,
 allowing for an approximate matrix-vector multiply in `O(n⋅log(n)⋅d)` operations.
 CovarianceFunctions.jl contains the `BarnesHutFactorization` structure whose constructor can be called on a `Gramian` matrix and applies the fast transform whenever `*` or `mul!` is called.
 
+```julia
+using CovarianceFunctions
+using LinearAlgebra
+k = CovarianceFunctions.EQ();
+
+d, n = 2, 65536;
+X = randn(d, n);
+G = gramian(k, X);
+@time F = CovarianceFunctions.BarnesHutFactorization(G, θ = 1/2);
+ 0.076532 seconds (720.93 k allocations: 40.189 MiB, 8.65% gc time)
+ ```
+Notably, constructing `F` computes a Ball tree structure over the data `X`, which allocates memory, but this is an `O(n⋅log(n)⋅d)` operation.
+The `θ ∈ [0, 1]` parameter trades off accuracy with speed.
+For `θ = 0`, multiplying with `F` is exact but `O(n²)`.
+Subsequent multiplications with `F` are fast:
+```julia
+a = randn(n);
+b = zero(a);
+@time mul!(b, F, a);
+  0.083348 seconds (209.54 k allocations: 15.712 MiB)
+```
+whereas for `n` of this order of magnitude, the lazy dense multiply takes much longer:
+```julia
+@time mul!(b, G, a);
+  4.938038 seconds (64 allocations: 5.688 KiB)
+```
+For `θ = 1/2`, we get around 2 digits of accuracy:
+```julia
+norm(G*a-F*a) / norm(G*a)
+  0.011654351196539835
+```
+By reducing `θ`, we can increase the accurary at the expense of runtime:
+```julia
+F = CovarianceFunctions.BarnesHutFactorization(G, θ = 1/4); # setting θ = 1/4 instead of 1/2 as above
+@time mul!(b, F, a);
+  0.223038 seconds (209.67 k allocations: 15.716 MiB)
+norm(G*a-F*a)/norm(G*a)
+  0.004286540849728051
+```
 
 <!-- * more covariance matrix factorizations (HODLR), or approximations (SKI, Nystrom, Random Kitchen Sinks, Fast Food)
 * GPLVMs