Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds support for detached commits #3028

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

westonpace
Copy link
Contributor

@westonpace westonpace commented Oct 21, 2024

A detached commit is a commit that is not part of the regular dataset lineage. It will never show up as the latest commit and is completely separate from the linear history of the dataset.

This can be useful for:

  • Making temporary changes to a dataset, like adding a column of values while computing an index without ever truly materializing the column.
  • Situations where the dataset's lineage is managed remotely and all commits are detached commits without any linear history (blob storage will use this)

Closes #2889

@github-actions github-actions bot added enhancement New feature or request python labels Oct 21, 2024
@westonpace
Copy link
Contributor Author

My original approach was much more complicated and used UUIDs as the version. However, if we keep the version as a u64 but borrow the most significant bit to flag detached vs. normal then we end up with much fewer changes and less overall complexity.

@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 30.97826% with 127 lines in your changes missing coverage. Please review.

Project coverage is 78.23%. Comparing base (f9024ce) to head (bb46d90).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/io/commit.rs 2.98% 65 Missing ⚠️
rust/lance/src/dataset.rs 44.82% 44 Missing and 4 partials ⚠️
rust/lance-table/src/io/commit.rs 48.14% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3028      +/-   ##
==========================================
+ Coverage   78.19%   78.23%   +0.03%     
==========================================
  Files         239      240       +1     
  Lines       76782    77227     +445     
  Branches    76782    77227     +445     
==========================================
+ Hits        60043    60419     +376     
- Misses      13669    13705      +36     
- Partials     3070     3103      +33     
Flag Coverage Δ
unittests 78.23% <30.97%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. Is there any special considerations we need to make for clean up?

Comment on lines +80 to +83
// Detached versions should never show up first in a list operation which
// means it needs to come lexicographically after all attached manifest
// files and so we add the prefix `d`. There is no need to invert the
// version number since detached versions are not part of the version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@westonpace
Copy link
Contributor Author

This seems reasonable. Is there any special considerations we need to make for clean up?

@wjones127

I'm planning on tackling cleanup later. My thinking is that cleanup will be something along the lines of:

cleanup_old_versions(
  since: ...,
  delete_unverified: bool,
  versions_to_retain: &[u64],
)

For the "detached versions are just temporary versions" case then versions_to_retain will be empty and the temporary versions will only be deleted if they are older than 7 days or delete_unverified (assuming the caller setting delete_unverified is confirming there are no ongoing temporary operations)

For the "this is a secondary database and everything is detached" case then cleanup will be triggered by a cleanup of the primary database. After the cleanup of the primary database we will scan all remaining versions (in the primary database) and collect which secondary versions are still referenced. These will be passed in as versions_to_retain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for a "staged commit"
3 participants