Skip to content
Paul R. C. Kent edited this page Jan 12, 2018 · 38 revisions

Welcome to the miniAFQMC wiki!

Introduction

This miniapp contains a simplified but computationally accurate implementation of the auxiliary field quantum Monte Carlo (AFQMC) algorithms implemented in the full production QMCPACK application. The miniapp is designed to enable tests of different programming methodologies, optimizations and algorithms.

How-To Guides

Prerequisites

Build miniAFQMC

cd build
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx ..
make 

CMake provides a number of optional variables that can be set to control the build and configure steps. When passed to CMake, these variables will take precedent over the environmental and default variables. To set them add -D FLAG=VALUE to the configure line between the CMake command and the path to the source directory.

  • General build options
CMAKE_C_COMPILER    Set the C compiler
CMAKE_CXX_COMPILER  Set the C++ compiler
CMAKE_BUILD_TYPE    A variable which controls the type of build (defaults to Release).
                    Possible values are:
                    None (Do not set debug/optmize flags, use CMAKE_C_FLAGS or CMAKE_CXX_FLAGS)
                    Debug (create a debug build)
                    Release (create a release/optimized build)
                    RelWithDebInfo (create a release/optimized build with debug info)
                    MinSizeRel (create an executable optimized for size)
CMAKE_C_FLAGS       Set the C flags.  Note: to prevent default debug/release flags
                    from being used, set the CMAKE_BUILD_TYPE=None
                    Also supported: CMAKE_C_FLAGS_DEBUG, CMAKE_C_FLAGS_RELEASE,
                                    CMAKE_C_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS     Set the C++ flags.  Note: to prevent default debug/release flags
                    from being used, set the CMAKE_BUILD_TYPE=None
                    Also supported: CMAKE_CXX_FLAGS_DEBUG, CMAKE_CXX_FLAGS_RELEASE,
                                    CMAKE_CXX_FLAGS_RELWITHDEBINFO

Executable

The main executable is created in ./bin folder.

miniafqmc              # runs a fake AFQMC and report the time spent in each component.

The miniafqmc executable requires a hdf5 input file and accepts various command line options. The input file, whose default name is afqmc.h5, contains various data structures associated with the calculation including trial wave-functions, Hamiltonian matrix elements, problem dimensions (# electrons, # orbitals, ...), etc. Due to the complicated process required to generate this input, it is not possible to execute the miniapp without this file. Several sample files are supplied with the miniapp in the examples directory. For additional examples covering different parameter space, contact the QMCPACK developers. Command line options are available to control several parameters in the calculation, e.g. # walkers, # steps, etc, all with reasonable default values. If more controls is needed, query by -h to print out available options.

Output

This is an example of the output produced by miniafqmc executed on example/N32_M64/afqmc.h5 with default options. The miniapp reports i) basic calculation parameters, ii) the average energy at each step, 3) timing information for the main sections of the calculation.

***********************************************************
                         Summary                           
***********************************************************

  AFQMC info: 
    name: miniAFQMC
    # of molecular orbitals: 64
    # of up electrons: 32
    # of down electrons: 32

  Execution details: 
    nsteps: 10
    nsubsteps: 10
    nwalk: 16
    northo: 10
    verbose: false
    # Chol Vectors: 1031
    transposed Spvn: true
    Chol. Matrix Sparsity: 0.0912627
    Hamiltonian Sparsity: 0.0585666

***********************************************************
                     Beginning Steps                       
***********************************************************

# Step   Energy   
0   24.5761
1   24.4027
2   24.3019
3   24.2015
4   24.1474
5   24.1166
6   24.0866
7   24.0929
8   23.9847
9   23.9405

***********************************************************
                   Finished Calculation                    
***********************************************************

Stack timer profile
Timer                   Inclusive_time  Exclusive_time  Calls       Time_per_call
Total                              3.0442     0.0002              1       3.044162035
  Bias Potential                   0.7084     0.7084            100       0.007084172
  H-S Potential                    0.5992     0.5992            100       0.005991509
  Local Energy                     0.1485     0.1485             10       0.014847875
  Orthgonalization                 0.0452     0.0452              9       0.005027586
  Other                            0.0004     0.0004            200       0.000001969
  Overlap                          0.1252     0.1252            109       0.001148493
  Propagation                      1.0463     1.0463            100       0.010463102
  Sigma                            0.0806     0.0806            100       0.000805798
  compact Mixed Density Matrix     0.2902     0.2902            100       0.002901709

Basics of Auxiliary-Field Quantum Monte Carlo

The Auxiliary-Field quantum Monte Carlo (AFQMC) method is an orbital-based quantum Monte Carlo (QMC) method designed for the study of interacting quantum many-body systems. For a complete description of the method and the algorithm, we refer the reader to one of the review articles on the subject; for example see Chapter 15 "Auxiliary-Field Quantum Monte Carlo for Correlated Electron Systems" of the online book "Emergent Phenomena in Correlated Matter" (https://www.cond-mat.de/events/correl13/manuscripts/) and references therein. Here we give a very brief description of the main elements of the algorithm, with emphasis on the steps of the imaginary-time propagation process implemented in the miniAFQMC miniapp. This contains most of the computationally intensive parts of a production level AFQMC calculation.

The algorithm is defined in terms of a given single particle basis, {f1}. From the point of view of the AFQMC algorithm, the single particle basis is arbitrary with the only requirement being the absence of linear dependencies within the states (the overlap matrix must be full rank). For simplicity, we restrict our implementation to orthogonal single particle basis sets. Examples of typical basis sets are: plane waves, Gaussian-based atom centered expansions (for both finite and periodic systems), PAW, etc. Within the given basis set, the second quantized Hamiltonian becomes:

f2,

where the 1- and 2-electron matrix elements are defined by:

f3,

f4.

The quantum state is represented by an ensemble of walkers:

f5,

where each walker represents a Slater determinant,

f6,

defined by the Slater matrix of the walker, W. The AFQMC method consists of an imaginary-time propagator of the quantum state towards a target eigenstate of the many-body system, typically the ground state. The Hubbard-Stratonovich transformation is used to express the propagator of the system (a 2 body operator) as an integral over non-interacting propagators (1 body operator) in a fluctuating external field:

f7,

where H1 is the part of the Hamiltonian that contains one-body operators and vHS is a factorization of the 2-electron matrix elements satisfying:

f8.

Using the fact that the application of a 1-body propagator on a Slater determinant leads to another Slater determinant,

f9,

it is straightforward to implement the imaginary-time propagator of the ensemble of walkers as a random walk in the non-orthogonal space of Slater determinants. In order to improve the sampling efficiency of the random walk and facilitate the implementation of phaseless constrains to remove the sign problem, a trial wave-function is typically used for importance sampling and force bias during the simulations. We omit the derivation of these approaches here and refer the reader to the literature for details. The resulting propagation algorithm can described with the pseudo-code below.

# loop over time steps
for tstep = 1:Nstep
  1. Calculate bias potential for all walkers 
  2. for each walker: wi  
    2a. Apply (half step) 1 body propagator 
    2b. Sample random fields from Gaussian distribution 
    2c. Apply H-S propagator (term with vHS)
    2d. Apply (half step) 1 body propagator
  3. Update walker overlaps (and possibly local energies)
  4. Update walker weights
  (optional) 5. Measure quantities
  (optional) 6. Orthogonalize walker's Slater matrix
  (optional) 7. Reconfigure walker population and redistribute walkers among processors.

Implementation Details

Here we provide a brief description of the implementation of the algorithm in miniAFQMC. We use pseudo-code to describe the mathematical operations when possible, actual implementations (e.g. memory layout, parallelization strategy, etc) depends on the version of the miniapp. The main data structures in the code are described below.

* N: number of electrons with spin up or down. The total number of electrons is 2*N.
* M: number of single particle states. (M > N).
* W(n,s,i,a): Array of Slater matrices of all walkers in a node. 
  - n={0:Nw}: walker index
  - s={0:1}: spin index
  - i={0:M}: basis index
  - a={0:N}: orbital index
* At(s,i,a): Slater matrix of the trial wave-function
  - s={0:1}: spin index
  - i={0:M}: basis index
  - a={0:N}: orbital index
* P1(i,j): 1-body propagator (pre-calculated and read from input file) 
* Spvn(ik,m): Matrix containing the factorization of the 2-electron integrals. Denoted here as the Cholesky matrix. Notice the combined index ik which represents a linearization of the 2-D index (i,k). ik=i*M+k. **Stored as a sparse matrix**.
  - ik={0:M*M}: index of basis pairs
  - m={0:nChol}: index of Cholesky vectors.
* Vakbl(ik,bl): Half-rotated 2-electron integrals. See the definition below. **Stored as a sparse matrix**.
  - ik/bl={0:M*M}: index of basis pairs
* vbias(m,n): bias potential used in force bias algorithm.
  - m={0:nChol}: index of Cholesky vectors.
  - n={0:Nw}: walker index.

Using the structures defined above, here is a list of important quantities along with their definition. There is an implicit spin dependence in all expressions below, which we suppress for clarity.

1. Green's function of a walker: Central quantity needed to evaluate expectation values of a walker. Notice that for efficiency purposes, we store all the walker's Green functions as a matrix, where the rows store the linearized index, ij=i*M+j, and the columns store the walker index.

f10

2. compact Green's function: Alternative expression for a walker's Green function used in actual evaulation of certain quantities. Same structure as the full Green function, but with reduced dimensions.

f11

3. Overlap of a walker and the trial Slater determinants.

f12

4. Local energy of a walker.

f13

The cost to evaluate the second term in this expression scales as O[M^4]. It is possible to reduce the cost to order O[N^2M^2] by defining the half-rotated 2-electron integral matrix, Vakbl, and using the following expression.

f14,

where Vakbl is defined as:

f15.

The expression above is the one used to evaluate the local energy in the code. By storing all the walker's Green functions as a matrix, it is possible to evaluate the above expression using a matrix-matrix multiplication followed by a vector dot product. Since the Vabkl matrix is highly sparse (typically only 1%-5% of terms are non-zero) and its size can be very large (M^4), we always use a sparse representation of this matrix and associated sparse matrix-dense matrix operations.

5. vias potential: Central quantity in the force-bias step of the algorithm. Essential for good sampling efficiency.

f16.

Similar to the case of the local energy, the Cholesky matrix, Spvn, is stored as a sparse matrix and appropriate sparse linear algebra operations are used. Even though this expression is not efficient, in miniAFQMC it is possible to use the expression defined above directly using the command line option "-t no". The default expression, which is more efficient and requires less operations (at the expense of additional memory) is:

f17,

where the half-transformed Cholesky matrix SpvnT is defined below:

f18.

miniAFQMC versions

The miniAFQMC repository contains several implementations of the algorithm. The goal is to use the miniapp to explore different programming models in the search for an performance portable, scalable and optimal implementation. Each implementation will be store in a different git branch. The master branch contains a base implementation based on boost multi_array. This implementation is serial and does not support explicit parallelization (parallelization only through threaded versions of blas). It is meant to be a reference implementation that defines the expected functionality and gives a baseline serial performance. The current branches in the repository are listed below.

master      # Base implementation. Serial execution with no explicit threading support.
mpi3_shm    # Distributed implementation based on MPI for internode communication and MPI3-Shared Memory for intranode communication. Operations local to a node are distributed among various core groups using an mpi-only framework based on shared-memory.
cuda        # Implementation for GPU architectures based on CUDA.
kokkos      # Portable implementation based on Kokkos.  

Future versions of the miniapp (not yet available) include a distributed implementation based on Scalapack and a distributed implementation based on MPI+OpenMP, similar in spirit to mpi3_shm, but using OpenMP for intranode concurrency.

Base miniAFQMC implementation details

Most mathematical objects in AFQMC are expressed as arrays: matrices, vectors of collections of them, either dense or sparse. These objects are represented by contiguous memory or regular subset inside contiguous memory blocks, due to performance requirements in general and to the need of using existing libraries in particular. Dimensionality of these arrays ranges from 1 to 4. Dense arrays and submatrices in particular need are represented as standard C-order (when possible) strides memory layouts.

Dense Arrays, representations and operations

Boost.MultiArray is utilized to abstract away the concept of strided, dense arrays or regular subviews of arbitrary dimension. Elements and submatrix views easily accessible while layout information can be extracted when needed.

At the moment, Boost.MultiArray is used in a non-intrusive mode, by simply representing arrays in memory that is preallocated in a controlled way. All dense arrays in contiguous are represented by boost::(const_)multi_array_ref and subviews types are generated via the overloads of ::operator[]. The high level code utilizes the indexing interface and leaves the details of memory layout and striding to the underlying numerical library. It is only at the lowest level that the ::strides() and ::data() interface (which exposes the memory layout) is used in the BLAS and Lapack low level calls. The library is utilized across dimensions in a uniform way.

More importantly the library helps define the concept of MultiArray that is utilized to pass any object that represents a (dense) arrays that is either contiguous in memory or refers to regular layouts inside a contiguous block of memory. This concept-based interface helps to abstract away all the code as actual types are forwarded and dispatched to the low level calls. Most functions involving arrays are templated and C++ perfect-forwarding is heavy utilized for output array arguements. Dimensionality mismatchs are catched at compile time. This allows for a natural syntax that will work on arrays or subarrays.

Ultimately, product(A, B, C), meaning generically A*B -> C is transformed (at compile-time) to a call to Xgem or Xaxpy depending on the dimensionality of A and B (plus a compile-time check that C has the correct dimensionality). Strides information is only extracted before the call to the BLAS or Lapack routine.

The syntax is extended to other cases like product(transposed(A), B, C) or product(transposed(A), transposed(B), C), meaning T(A)*B -> C and T(A)*T(B) -> C respectively. In the processes transforming high-level syntax into low-level calls, the utilization of if conditional branched is minimized as most of the decisions are taken at compile-time.

Since Boost.MultiArray does runtime bounds check by default it is important to compile in release mode before performance test or at the least #define BOOST_DISABLE_ASSERTS.

The MultiArray framework allows for general dimensionality and all sorts of (regular) memory layouts. However, linear algebra operations are restricted to C-ordering, which the only ordering utilized in base code so far. When ordering is incompatible with the BLAS/Lapack call an assert (active in debug mode) is raised (no heroic attempts to rearrange data is attempted).

Another current limitation is that once the memory layout is chosen (C-ordering at the moment for contiguous arrays) the order of loops accessing elements has to be consistent to maintain performance. The Kokkos version will abstract away these loops. Since Kokkos is in some way a generalization of Boost.MultiArray we expect the transition from MultiArray to be smooth while in the meantime benefiting from the syntax and performance that MultiArray allows.

Sparse Matrix, representations and operations

Bidimensional sparse arrays are represented in CRS (Compressed Row Storage) format. We utilize a syntax similar to Boost.MulitArray when possible and when it doesn't interfere with performance, in order to have a uniform code. Therefore operations like product(spA, B, C) or product(transposed(spA), B, C) are dispatched into the appropriated SparseBLAS calls, for example crsmm (present in MKL BLAS)

Sparse matrices are restricted to two dimensions and only certain operations are defined (e.g. only the first argument of product can be sparse). Subviews (partial sparse matrices) are very restricted at moment.