Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

files cache improvements #8389

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 12 additions & 33 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -837,50 +837,29 @@ already used.
By default, ctime (change time) is used for the timestamps to have a rather
safe change detection (see also the --files-cache option).

Furthermore, pathnames recorded in files cache are always absolute, even if you
specify source directories with relative pathname. If relative pathnames are
stable, but absolute are not (for example if you mount a filesystem without
stable mount points for each backup or if you are running the backup from a
filesystem snapshot whose name is not stable), borg will assume that files are
different and will report them as 'added', even though no new chunks will be
actually recorded for them. To avoid this, you could bind mount your source
directory in a directory with the stable path.
Furthermore, pathnames used as key into the files cache are **as archived**,
so make sure these are always the same (see ``borg list``).

.. _always_chunking:

It always chunks all my files, even unchanged ones!
---------------------------------------------------

Borg maintains a files cache where it remembers the timestamp, size and
Borg maintains a files cache where it remembers the timestamps, size and
inode of files. When Borg does a new backup and starts processing a
file, it first looks whether the file has changed (compared to the values
stored in the files cache). If the values are the same, the file is assumed
unchanged and thus its contents won't get chunked (again).

Borg can't keep an infinite history of files of course, thus entries
in the files cache have a "maximum time to live" which is set via the
environment variable BORG_FILES_CACHE_TTL (and defaults to 20).
Every time you do a backup (on the same machine, using the same user), the
cache entries' ttl values of files that were not "seen" are incremented by 1
and if they reach BORG_FILES_CACHE_TTL, the entry is removed from the cache.

So, for example, if you do daily backups of 26 different data sets A, B,
C, ..., Z on one machine (using the default TTL), the files from A will be
already forgotten when you repeat the same backups on the next day and it
will be slow because it would chunk all the files each time. If you set
BORG_FILES_CACHE_TTL to at least 26 (or maybe even a small multiple of that),
it would be much faster.

Besides using a higher BORG_FILES_CACHE_TTL (which also increases memory usage),
there is also BORG_FILES_CACHE_SUFFIX which can be used to have separate (smaller)
files caches for each backup set instead of the default one (big) unified files cache.

Another possible reason is that files don't always have the same path, for
example if you mount a filesystem without stable mount points for each backup
or if you are running the backup from a filesystem snapshot whose name is not
stable. If the directory where you mount a filesystem is different every time,
Borg assumes they are different files. This is true even if you back up these
files with relative pathnames - borg uses full pathnames in files cache regardless.
The files cache is stored separately (using a different filename suffix) per
archive series, thus using always the same name for the archive is strongly
recommended. The "rebuild files cache from previous archive in repo" feature
also depends on that.
Alternatively, there is also BORG_FILES_CACHE_SUFFIX which can be used to
manually set a custom suffix (if you can't just use the same archive name).

Another possible reason is that files don't always have the same path -
borg uses the paths as seen in the archive when using ``borg list``.

It is possible for some filesystems, such as ``mergerfs`` or network filesystems,
to return inconsistent inode numbers across runs, causing borg to consider them changed.
Expand Down
14 changes: 8 additions & 6 deletions docs/internals/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -474,18 +474,20 @@ guess what files you have based on a specific set of chunk sizes).
The cache
---------

The **files cache** is stored in ``cache/files`` and is used at backup time to
quickly determine whether a given file is unchanged and we have all its chunks.
The **files cache** is stored in ``cache/files.<SUFFIX>`` and is used at backup
time to quickly determine whether a given file is unchanged and we have all its
chunks.

In memory, the files cache is a key -> value mapping (a Python *dict*) and contains:

* key: id_hash of the encoded, absolute file path
* key: id_hash of the encoded path (same path as seen in archive)
* value:

- age (0 [newest], ..., BORG_FILES_CACHE_TTL - 1)
- file inode number
- file size
- file ctime_ns (or mtime_ns)
- age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
- file ctime_ns
- file mtime_ns
- list of chunk (id, size) tuples representing the file's contents

To determine whether a file has not changed, cached values are looked up via
Expand Down Expand Up @@ -514,7 +516,7 @@ be told to ignore the inode number in the check via --files-cache.
The age value is used for cache management. If a file is "seen" in a backup
run, its age is reset to 0, otherwise its age is incremented by one.
If a file was not seen in BORG_FILES_CACHE_TTL backups, its cache entry is
removed. See also: :ref:`always_chunking` and :ref:`a_status_oddity`
removed.

The files cache is a python dictionary, storing python objects, which
generates a lot of overhead.
Expand Down
12 changes: 1 addition & 11 deletions docs/usage/general/environment.rst.inc
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,7 @@ General:
cache entries for backup sources other than the current sources.
BORG_FILES_CACHE_TTL
When set to a numeric value, this determines the maximum "time to live" for the files cache
entries (default: 20). The files cache is used to determine quickly whether a file is unchanged.
The FAQ explains this more detailed in: :ref:`always_chunking`
entries (default: 2). The files cache is used to determine quickly whether a file is unchanged.
BORG_USE_CHUNKS_ARCHIVE
When set to no (default: yes), the ``chunks.archive.d`` folder will not be used. This reduces
disk space usage but slows down cache resyncs.
Expand All @@ -85,15 +84,6 @@ General:
- ``pyfuse3``: only try to load pyfuse3
- ``llfuse``: only try to load llfuse
- ``none``: do not try to load an implementation
BORG_CACHE_IMPL
Choose the implementation for the clientside cache, choose one of:

- ``adhoc``: builds a non-persistent chunks cache by querying the repo. Chunks cache contents
are somewhat sloppy for already existing chunks, concerning their refcount ("infinite") and
size (0). No files cache (slow, will chunk all input files). DEPRECATED.
- ``adhocwithfiles``: Like ``adhoc``, but with a persistent files cache. Default implementation.
- ``cli``: Determine the cache implementation from cli options. Without special options, will
usually end up with the ``local`` implementation.
BORG_SELFTEST
This can be used to influence borg's builtin self-tests. The default is to execute the tests
at the beginning of each borg command invocation.
Expand Down
23 changes: 18 additions & 5 deletions src/borg/archive.py
Original file line number Diff line number Diff line change
Expand Up @@ -1345,7 +1345,7 @@
item.chunks.append(chunk_entry)
else: # normal case, no "2nd+" hardlink
if not is_special_file:
hashed_path = safe_encode(os.path.join(self.cwd, path))
hashed_path = safe_encode(item.path) # path as in archive item!
started_hashing = time.monotonic()
path_hash = self.key.id_hash(hashed_path)
self.stats.hashing_time += time.monotonic() - started_hashing
Expand Down Expand Up @@ -1376,6 +1376,7 @@
# Only chunkify the file if needed
changed_while_backup = False
if "chunks" not in item:
start_reading = time.time_ns()
with backup_io("read"):
self.process_file_chunks(
item,
Expand All @@ -1385,13 +1386,25 @@
backup_io_iter(self.chunker.chunkify(None, fd)),
)
self.stats.chunking_time = self.chunker.chunking_time
end_reading = time.time_ns()
if not is_win32: # TODO for win32
with backup_io("fstat2"):
st2 = os.fstat(fd)
# special files:
# - fifos change naturally, because they are fed from the other side. no problem.
# - blk/chr devices don't change ctime anyway.
changed_while_backup = not is_special_file and st.st_ctime_ns != st2.st_ctime_ns
if is_special_file:
# special files:
# - fifos change naturally, because they are fed from the other side. no problem.
# - blk/chr devices don't change ctime anyway.
pass
elif st.st_ctime_ns != st2.st_ctime_ns:
# ctime was changed, this is either a metadata or a data change.
changed_while_backup = True

Check warning on line 1400 in src/borg/archive.py

View check run for this annotation

Codecov / codecov/patch

src/borg/archive.py#L1400

Added line #L1400 was not covered by tests
elif start_reading - TIME_DIFFERS1_NS < st2.st_ctime_ns < end_reading + TIME_DIFFERS1_NS:
# this is to treat a very special race condition, see #3536.
# - file was changed right before st.ctime was determined.
# - then, shortly afterwards, but already while we read the file, the
# file was changed again, but st2.ctime is the same due to ctime granularity.
# when comparing file ctime to local clock, widen interval by TIME_DIFFERS1_NS.
changed_while_backup = True

Check warning on line 1407 in src/borg/archive.py

View check run for this annotation

Codecov / codecov/patch

src/borg/archive.py#L1407

Added line #L1407 was not covered by tests
if changed_while_backup:
# regular file changed while we backed it up, might be inconsistent/corrupt!
if last_try:
Expand Down
1 change: 1 addition & 0 deletions src/borg/archiver/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ class Archiver(
def __init__(self, lock_wait=None, prog=None):
self.lock_wait = lock_wait
self.prog = prog
self.start_backup = None

def print_warning(self, msg, *args, **kw):
warning_code = kw.get("wc", EXIT_WARNING) # note: wc=None can be used to not influence exit code
Expand Down
7 changes: 3 additions & 4 deletions src/borg/archiver/_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,14 +161,14 @@ def wrapper(self, args, **kwargs):
if "compression" in args:
manifest_.repo_objs.compressor = args.compression.compressor
if secure:
assert_secure(repository, manifest_, self.lock_wait)
assert_secure(repository, manifest_)
if cache:
with Cache(
repository,
manifest_,
progress=getattr(args, "progress", False),
lock_wait=self.lock_wait,
cache_mode=getattr(args, "files_cache_mode", FILES_CACHE_MODE_DISABLED),
start_backup=getattr(self, "start_backup", None),
iec=getattr(args, "iec", False),
) as cache_:
return method(self, args, repository=repository, cache=cache_, **kwargs)
Expand Down Expand Up @@ -230,15 +230,14 @@ def wrapper(self, args, **kwargs):
manifest_ = Manifest.load(
repository, compatibility, ro_cls=RepoObj if repository.version > 1 else RepoObj1
)
assert_secure(repository, manifest_, self.lock_wait)
assert_secure(repository, manifest_)
if manifest:
kwargs["other_manifest"] = manifest_
if cache:
with Cache(
repository,
manifest_,
progress=False,
lock_wait=self.lock_wait,
cache_mode=getattr(args, "files_cache_mode", FILES_CACHE_MODE_DISABLED),
iec=getattr(args, "iec", False),
) as cache_:
Expand Down
10 changes: 2 additions & 8 deletions src/borg/archiver/create_cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ def create_inner(archive, cache, fso):
self.noxattrs = args.noxattrs
self.exclude_nodump = args.exclude_nodump
dry_run = args.dry_run
self.start_backup = time.time_ns()
t0 = archive_ts_now()
t0_monotonic = time.monotonic()
logger.info('Creating archive at "%s"' % args.location.processed)
Expand All @@ -222,10 +223,9 @@ def create_inner(archive, cache, fso):
repository,
manifest,
progress=args.progress,
lock_wait=self.lock_wait,
prefer_adhoc_cache=args.prefer_adhoc_cache,
cache_mode=args.files_cache_mode,
iec=args.iec,
archive_name=args.name,
) as cache:
archive = Archive(
manifest,
Expand Down Expand Up @@ -787,12 +787,6 @@ def build_parser_create(self, subparsers, common_parser, mid_common_parser):
help="only display items with the given status characters (see description)",
)
subparser.add_argument("--json", action="store_true", help="output stats as JSON. Implies ``--stats``.")
subparser.add_argument(
"--prefer-adhoc-cache",
dest="prefer_adhoc_cache",
action="store_true",
help="experimental: prefer AdHocCache (w/o files cache) over AdHocWithFilesCache (with files cache).",
)
subparser.add_argument(
"--stdin-name",
metavar="NAME",
Expand Down
2 changes: 1 addition & 1 deletion src/borg/archiver/list_cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def _list_inner(cache):

# Only load the cache if it will be used
if ItemFormatter.format_needs_cache(format):
with Cache(repository, manifest, lock_wait=self.lock_wait) as cache:
with Cache(repository, manifest) as cache:
_list_inner(cache)
else:
_list_inner(cache=None)
Expand Down
2 changes: 1 addition & 1 deletion src/borg/archiver/prune_cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def do_prune(self, args, repository, manifest):
keep += prune_split(archives, rule, num, kept_because)

to_delete = set(archives) - set(keep)
with Cache(repository, manifest, lock_wait=self.lock_wait, iec=args.iec) as cache:
with Cache(repository, manifest, iec=args.iec) as cache:
list_logger = logging.getLogger("borg.output.list")
# set up counters for the progress display
to_delete_len = len(to_delete)
Expand Down
Loading
Loading