Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various improvements to the GCS VFS. #4008

Closed
wants to merge 22 commits into from

Conversation

teo-tsirpanis
Copy link
Member

See each commit message for more details. This PR includes a fix for SC-26982


TYPE: IMPROVEMENT
DESC: Various improvements to the Google Cloud Storage VFS

@shortcut-integration
Copy link


return Status::Ok();
return {std::move(bucket_name), std::move(object_path)};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, does RVO kick in here or are these moves necessary?

@teo-tsirpanis teo-tsirpanis marked this pull request as draft April 4, 2023 19:03
@teo-tsirpanis
Copy link
Member Author

CI fails because because of composing failures. Will investigate.

@teo-tsirpanis teo-tsirpanis marked this pull request as ready for review April 19, 2023 10:46
@Shelnutt2
Copy link
Member

Are we sure we want Remove polling for bucket/object propagation/deletion.? For multi-part uploads we've learned with GCS that they don't guarantee read-after-write, as such I think we need the wait for object to propagate?

@teo-tsirpanis
Copy link
Member Author

Are you referring to this sentence of the docs? We are not using this API, but good question, it its not explicitly mentioned if object composition is strongly consistent. I submitted feedback on the docs.

@eric-hughes-tiledb
Copy link
Contributor

eric-hughes-tiledb commented Apr 19, 2023

a fix for SC-26982

That issue does not report a defect. What's there to fix? A putative performance improvement is not a defect.

The GCS docs are incomplete and ambiguous about the relationship between multipart uploads and operations that are strongly consistent. The documentation about strong consistency refers only to a subset of operations, and it does not address issues of mutual consistency between these classes of operations. Therefore, we must assume that multipart uploads and (say) listing are not mutual consistent. In absence of a clear and definitive statement that states the relationship, we must not assume that the system has the properties we might wish it to have.

The mere existence of CompleteMultipartUpload as a GCS request is a strong signal that multipart uploads do not have the strong consistency required. So removing polling without replacing it with a different mechanism for determining completion is a Very Bad Idea. It's going to lead to occasional loss of data. Integrity of data is paramount. We may not jeopardize it.

Recommend closing this PR and not returning to until we can measure the performance improvement that might be gained for what looks to be a rather large risk.

Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed: we need to wait on this for 2.16. Please also put back the ports files.

@ihnorton
Copy link
Member

The mere existence of CompleteMultipartUpload as a GCS request is a strong signal that multipart uploads do not have the strong consistency required.

CompleteMultipartUpload is necessary because the multipart API does not require pre-specifying the total size of the final assembled object (it allows appending up to 10,000 parts before finalizing). See: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html

@ihnorton
Copy link
Member

ihnorton commented Apr 19, 2023

Let's keep the changes separate, so we can ship the first two:

  1. update GCS SDK version in Update the GCS SDK to its latest version. #4031 after 2.16
    1a. or 2. use of the new InsertObject signature (can be part of 1) after bumping to GCS 2.9

...
3. Possibly remove polling as in this PR, if we can fully substantiate a safe performance improvement

@teo-tsirpanis
Copy link
Member Author

multipart uploads do not have the strong consistency required.

@eric-hughes-tiledb we are not using the S3-compatible multi-part uploads in GCS. In fact they are not even available in the C++ SDK. What we use is the Compose JSON API, which is officially confirmed to be strongly consistent.

@teo-tsirpanis
Copy link
Member Author

Marking as draft again, let's do #4031 first.

@teo-tsirpanis teo-tsirpanis marked this pull request as draft April 20, 2023 12:14
@teo-tsirpanis teo-tsirpanis marked this pull request as ready for review November 13, 2023 18:44
Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new algorithm for something as basic as a new VFS implementation needs better validation than is present in this PR.

This won't be ready for 2.19.

@KiterLuc
Copy link
Contributor

Closing this for now. The change is a good one but we don't have time to properly test this at the moment. We also have a tracking story for when we are ready to do this work.

@KiterLuc KiterLuc closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants