BuildXL needs to support non-deterministic build systems with weakly specified dependencies. In order to do this, BuildXL has a two-phase caching algorithm. This page walks through the behavior of this algorithm.
For more information on the technical implementation of this design, see this documentation of the BuildXL caching system implementation.
For precision, here are the terms used in describing the design.
First, the terms that apply to fully deterministic BuildXL graphs with fully specified dependencies:
- Pip: a single build step (for purposes of this discussion, a single process execution)
- Declared input / output files: files statically declared at BuildXL graph construction time as being inputs or outputs of the pip
- Fingerprint: the hashes of the command line, declared input file paths and contents, and declared output file paths for a pip
In a fully deterministic build with fully specified dependencies, the fingerprint of each pip only needs to be computed from the fully statically known information about to the pip. So in this setting, there is only one kind of fingerprint, which can ignore everything that was actually done by the pip -- since we know upfront everything the pip will do.
To work with legacy build systems, BuildXL introduces several new concepts.
- Declared input / output directories: Pips now can specify directories that they write to (and read from), without precisely specifying what files they will actually write (or read).
- Shared opaque directories: Pips can write to output directories shared with other pips, such that the directory contents are dynamically populated, and not known at build graph construction time.
- Observed input / output file accesses: When writing to opaque directories, BuildXL must now track exactly which files were read or written. These sets of accesses are recorded by pathsets.
- Pathsets: Pathsets record sets of observed file accesses (reads, probes, directory enumerations) made by a specific pip.
- Weak fingerprint: The statically known inputs and outputs of the pip.
- Strong fingerprint: A combination of the pip's weak fingerprint's hash, and the hashes of the files read by that pip's pathset when it executed.
We now have the terminology to describe the two phase cache lookup algorithm itself.
Consider the following running example of a (hypothetical) C++ project in a legacy source tree we want to run under BuildXL.
Here's burger.mk#1
. burger.mk#1
denotes version 1 of burger.mk
:
INPUTS =
bread.cpp
patty.cpp
sauce.cpp
INCLUDEDIRS =
/proteins
OUTPUT =
/dinner/burger.exe
Here is the directory tree when we execute burger.mk
's pip:
- burger/
- burger.mk#1
- bread.cpp#1
- patty.cpp#1
- sauce.cpp#1
- proteins/
- grnd_beef.h#1
- tofu.h#1
When constructing the build graph, we examine burger.mk
and we learn what we can about the files it reads and writes.
We use this information to construct the weak fingerprint for burger.mk
's pip.
The weak fingerprint hashes the statically known input/output information for the pip. In this case, it consists of this state, where hashes of files are always hashes of specific file versions:
WF1:
- Hash(burger/bread.cpp#1)
- Hash(burger/patty.cpp#1)
- Hash(burger/sauce.cpp#1)
- Hash("CommandLine:cl.exe /option1 /option2")
- Hash("InputDirectory:/proteins")
- Hash("Output:/dinner/burger.exe")
WF1
is the label we'll use for this weak fingerprint.
This is called the weak fingerprint because it is only a partial specification of the pip. In particular, it does not specify all the files that might get read when making the burger.
All of this content is hashed to produce the weak fingerprint hash for the pip. The weak fingerprint hash changes only when the statically known inputs/outputs of the pip change.
Suppose that patty.cpp
contains the following (and suppose that no other .cpp includes anything):
#include <grnd_beef.h>
When burger.mk
is executed, patty.cpp
will be read by the C++ compiler, and proteins/grnd_beef.h
will be
dynamically read. This then will become a pathset recorded for this pip, indexed by the pip's weak fingerprint:
PS1:
- Hit(proteins/grnd_beef.h)
This is the additional information needed to make a more precise -- e.g. stronger -- fingerprint of this pip. Note that pathsets list only pathname hits (or misses, as we'll see soon), not actual file hashes or even file versions.
The strong fingerprint for this execution of the burger.mk
pip consists of both the hash of the weak fingerprint,
and the hashes of all file versions read in the associated pathset:
SF1:
- Hash(WF1)
- Hash(proteins/grnd_beef.h#1)
This fully represents the actual, concrete inputs to the executed pip. Hence, this information is usable to perform cache lookups.
So, for this first-ever compilation of burger.mk
, BuildXL does the following:
- Process the build graph to compute
WF1
. - Look up all known pathsets for
WF1
- Determine that there are none
- Execute the pip, recording
PS1
- Compute a new strong fingerprint for the pip,
SF1
- Store the pip's computed output,
dinner/burger.exe
, as an output ofSF1
If we go to re-execute a build, BuildXL will do the following:
- Process the build graph to compute
WF1
. (This is always deterministic so it gets the same answer on the same input.) - Look up all known pathsets for
WF1
- Find
PS1
- Find
- Determine whether that pathset matches the current known inputs (e.g. does
proteins/grnd_beef.h
exist in the current build)- In this case, it still does exist, so the pathset is a match
- Look up strong fingerprints based on the hashes of
WF1
and the matching pathsetPS1
- Match
SF1
becauseproteins/grnd_beef.h#1
is still the version of that header file in this build
- Match
- Look up
dinner/burger.exe
as an output of the cachedSF1
pip
This is the essence of the two phase cache lookup:
- Compute weak fingerprint from static information.
- Find pathsets associated with that weak fingerprint, which match the payload of the current build.
- Find strong fingerprints that match the weak fingerprint and the file versions in the matching pathsets.
- Perform cache lookups for actual cached outputs, based on the strong fingerprints.
It's now useful to consider how this design handles various other scenarios:
- What if a new header file directory is added?
- What if a conflicting copy of
grnd_beef.h
gets added in that new directory? - What if the contents of
grnd_beef.h
are modified to include yet another file?
The design handles these cases, as we'll see.
Suppose we add a new header directory /organic
in burger.mk#2
:
INPUTS =
bread.cpp
patty.cpp
sauce.cpp
INCLUDEDIRS =
/organic
/proteins
OUTPUT =
/dinner/burger.exe
We prefer organic ingredients so the new "organic" directory comes first. And we add it in the source tree:
- burger/
burger.mk#2
- bread.cpp#1
- patty.cpp#1
- sauce.cpp#1
- organic/
- tofu.h#1
- proteins/
- grnd_beef.h#1
- tofu.h#1
We don't have any organic ground beef yet.
This results in a new weak fingerprint, since there's a new input directory in burger.mk#2
:
WF2:
- Hash(burger/bread.cpp#1)
- Hash(burger/patty.cpp#1)
- Hash(burger/sauce.cpp#1)
- Hash("CommandLine:cl.exe /option1 /option2")
- Hash("InputDirectory:/organics")
- Hash("InputDirectory:/proteins")
- Hash("Output:/dinner/burger.exe")
When building, the C++ compiler will probe the organic
directory looking for grnd_beef.h
and will not
find it. The pathset computed in this case will be:
PS2:
- Miss(organic/grnd_beef.h)
- Hit(proteins/grnd_beef.h)
The "miss" entry records that no file existed at that attempted path lookup.
Now when BuildXL builds this pip, the following occurs:
- Since there is a new include directory, BuildXL computes
WF2
. - No pathsets exist for this weak fingerprint, so BuildXL executes the pip and computes
PS2
. - BuildXL computes
SF2
fromWF2
andPS2
and associates the output with it:
SF2:
- Hash(WF2)
- Hash(proteins/grnd_beef.h#1)
We get some organic ground beef in stock, and the source tree becomes:
- burger/
burger.mk#2
- bread.cpp#1
- patty.cpp#1
- sauce.cpp#1
- organic/
- grnd_beef.h#1
- tofu.h#1
- proteins/
- grnd_beef.h#1
- tofu.h#1
Now when we next do a build, BuildXL does the following:
- Construct the weak fingerprint for the pip; since
burger.mk
didn't change, this yieldsWF2
again. - Look for pathsets matching
WF2
.- Find
PS2
. - Determine that
PS2
is not a match sinceorganic/grnd_beef.h
is no longer missing.
- Find
- Execute
burger.mk
determing pathsetPS3
:
PS3:
- Hit(organic/grnd_beef.h)
- Hit(proteins/grnd_beef.h)
(The C++ compiler sometimes probes for more files than it actually reads when compiling, so in this case
it will hit proteins/grnd_beef.h
even though it's not actually included by the compilation.)
- Compute strong fingerprint
SF3
based onWF2
andPS3
, and associate the resultingburger.exe
with it:
SF3:
- Hash(WF2)
- Hash(organic/grnd_beef.h#1)
- Hash(proteins/grnd_beef.h#1)
In general, changes to makefiles and static build information result in new weak fingerprints, and changes to dynamic file accesses result in new pathsets; since both weak fingerprints and pathsets have to match in order to match a strong fingerprint, this ensures that both static and dynamic changes get tracked through the cache.
Finally, suppose the developer modifies organic/grnd_beef.h
to include another file:
#include "beef.h"
And this gets added in the source tree:
- burger/
burger.mk#2
- bread.cpp#1
- patty.cpp#1
- sauce.cpp#1
- organic/
- beef.h#1
- grnd_beef.h#2
- tofu.h#1
- proteins/
- grnd_beef.h#1
- tofu.h#1
When we compile, BuildXL will now do the following:
- Compute weak fingerprint from
burger.mk#2
yieldingWF2
(since we still didn't change the makefile) - Look up pathsets associated with
WF2
- Find pathsets
PS2
andPS3
PS2
is not a match becauseorganic/grnd_beef.h
now existsPS3
, however, is a match
- Find pathsets
- Look up strong fingerprints associated with
WF2
andPS3
- Find
SF3
- Determine that
SF3
is not a match becauseorganic/grnd_beef.h
changed (from #1 to #2)
- Find
- Ultimately re-execute the pip because of a cache miss
- Compute a new path set:
PS4:
- Hit(organic/grnd_beef.h)
- Hit(organic/beef.h)
- Hit(proteins/grnd_beef.h)
- Compute a new strong fingerprint:
SF4:
- Hash(WF2)
- Hash(organic/grnd_beef.h#2)
- Hash(organic/beef.h#1)
- Hash(proteins/grnd_beef.h#1)
- Associate
dinner/burger.exe
withSF4
.
This algorithm works well to handle the multiple sources of potential change:
- Changes to the statically known build graph cause weak fingerprint misses.
- Changes to patterns of dynamic file access cause pathset lookup misses.
- Changes to dynamically accessed file contents cause strong fingerprint misses.
In production, the primary risk can be pathset lookup. It is possible for a weak fingerprint to be too weak. If a large number of pips have only a small amount of statically known inputs, and tend to read many of the same files at runtime, it can result in pathset explosion: the number of potentially matching pathsets grows to a point that it degrades performance.
BuildXL implements a heuristic to deal with the case where a weak fingerprint is too weak and produces a large number of unique pathsets under it. In some particular scenarios where the weak fingerprint is known to be not strong enough, this behavior is enabled by default. This includes all the family of the JavaScript resolvers, as well as CMake/Ninja and MSBuild resolvers. For the case of DScript, the behavior can enabled at the pip level by setting
enforceWeakFingerprintAugmentation: true
as part of the transformer execute arguments.
A high level description of the heuristic is described below:
- Let's say BuildXL performs a cache lookup with a given weak fingerprint
fp
. Assume the lookup turns out to be a miss and the pip is executed. If the lookup involves more thann
downloaded unique path sets, BuildXL won't push to the cache the result of the execution under the givenfp
, but it will use an augmentedaug_fp
for it. - How is this augmented weak fingerprint computed? BuildXL will try to extract some common paths from all the downloaded path sets and will construct with that a new
augmented
path set. A strong fingerprint will be then computed based on that augmented path set, and the result will be used as the new augmented weak fingerprintaug_fp
. The result of executing the pip will be stored using the augmented fingerprint, not the original one, effectively branching the candidates that would have been stored under the original weak fingerprint under the augmented one. - When an augmented path set is produced, BuildXL also pushes a special entry to the cache that represents the
fp -> augmented path set
mapping. The next time a cache lookup againstfp
happens, the special entryfp -> augmented path
will be also retrieved. That will make BuildXL to query the cache again, but now for all candidates with weak fingerprintaug_fp = strong_fingerprint(augmented path set)
.
Observe that since the augmented weak fingerprint is locally computed from an augmented path set, the result will depend on the local state of the disk, and therefore only the candidates pushed under this particular weak fingerprint will be retrieved. This effectively reduces the number of candidates that are retrieved overall compared with not using the augmented fingerprint heuristic.
Two knobs that affect this heuristic can be configured:
/pathSetThreshold
: The maximum number of visited path sets allowed before switching to an 'augmented' weak fingerprint computed from common dynamically accessed paths ('n
' in the above description). Default is5
./augmentingPathSetCommonalityFactor
: Used to compute the number of times an entry must appear among paths in the observed path set in order to be included in the common path set. Value must be (0, 1]. Default is0.4
.
This means that, by default, after downloading 5
different path sets, an augmented one will be computed. If a path is present in at least 2
(5 * 0.4
) of the 5
different path sets, it will be included in the augmented one.