Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating better file content streaming RPC methods #799

Draft
wants to merge 8 commits into
base: staging
Choose a base branch
from

Conversation

aryanjassal
Copy link
Contributor

@aryanjassal aryanjassal commented Aug 30, 2024

Description

Currently, VaultsSecretsGet and VaultsSecretsEdit are UnaryHandlers, which poses many restrictions on its functionality, and especially impacts speed and performance. As such, I will be replacing them by a trimmed-down version of fileTree.serializerStreamFactory and fileTree.parserTransformStreamFactory. (#785 (comment), #785 (comment)).

This PR will also remove the two aforementioned RPC methods, instead introducing two new methods, VaultsSecretsGet and VaultSecretsSet for doing the same, but using RawHandlers instead. (MatrixAI/Polykey-CLI#267 (comment)).

This is important for many UNIX commands like secrets edit, secrets cat, etc.

Issues Fixed

Tasks

  • 1. Trim functionality of parserTransformerStreamFactory and serializerStreamFactory
  • 2. Update VaultsSecretsGet
  • 3. Update VaultsSecretsEdit to VaultsSecretsSet

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@aryanjassal aryanjassal self-assigned this Aug 30, 2024
Copy link

linear bot commented Aug 30, 2024

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 30, 2024 via email

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 30, 2024 via email

@aryanjassal
Copy link
Contributor Author

If we aren't going to use the old methods anymore, then the API isn't guaranteed now anyway, you can just remove and replace. However I believe we have a naming convention for streaming methods. You should check.

I asked @tegefaulkes and we settled on calling the methods VaultsSecretsGet and VaultsSecretsSet, as that is concise and makes the most sense.

If you don't like these names, an alternative naming scheme can be calling the methods VaultsSecretsFetch and VaultsSecretsUpdate.

@aryanjassal
Copy link
Contributor Author

The behaviour of serializerStreamFactory is to stream the contents of multiple files in the order of the files. This will send data as a stream to the client.

But how do we handle incoming data? Do we also expect the client to stream secret contents back to us in the same format of multiple files? Or do we expect a single file only? Would it even make sense for the client to stream multiple files back? When would that even happen? The same question goes for streaming file contents to the client. When would we even need the functionality of streaming multiple files concurrently? AFAIK, no UNIX command we will be implementing would need that (globbing might require that, but we are not globbing anyways).

I just need some clarity on these topics @tegefaulkes.

@aryanjassal aryanjassal marked this pull request as draft September 2, 2024 06:43
@aryanjassal
Copy link
Contributor Author

On that note, the CONTENT header can be simplified further by removing the iNode field. This is because the files returned will be in the same order that they were sent in. Thus, we don't need to keep track of which file is being returned and in what order.

Can't we also simplify it further by removing the type field, as it will always and only return CONTENT objects? Then, the header would be simplified to only the dataSize. At this point, can't we just make a terminator sequence and break/split the streams when we hit that sequence instead of dealing with serializing headers?

@tegefaulkes
Copy link
Contributor

Even without globbing we still need to send entire trees for the mv and cp command. In those cases we need to know about directories, their contents and sending multiple files at once.

@aryanjassal
Copy link
Contributor Author

Even without globbing we still need to send entire trees for the mv and cp command. In those cases we need to know about directories, their contents and sending multiple files at once.

Now that I have removed the functionality of sending over file trees from serializer and parser, how do we plan on doing this? Maybe sending a JSON object as a response containing an array of the file paths in the order of serialisation might work, but I'm not sure how would I handle doing that for mv and cp commands.

Can't we also simplify it further by removing the type field, as it will always and only return CONTENT objects? Then, the header would be simplified to only the dataSize. At this point, can't we just make a terminator sequence and break/split the streams when we hit that sequence instead of dealing with serializing headers?

We cannot do this, as we will rely on sending over TreeNodes about file data for future iterations of secrets ls command. So, I think that we should retain the type field for the time being, and only remove the iNode field.

@CMCDragonkai
Copy link
Member

I haven't reviewed this in detail, but just some advice, remember to don't make the API too complex, we may expect other clients to integrate into the future. My style is to always start with simple functions and then combine them into more complex functionality using combinators. The functional programming style, in the context of RPC, we don't actually have procedure composition like in Cap'n Proto, so I always start with simple f and then create complex variants like f' and f''. And the suffix names is what is used to indicate variants. I want to maintain this conceptual integrity.

@CMCDragonkai
Copy link
Member

I haven't reviewed this in detail, but just some advice, remember to don't make the API too complex, we may expect other clients to integrate into the future. My style is to always start with simple functions and then combine them into more complex functionality using combinators. The functional programming style, in the context of RPC, we don't actually have procedure composition like in Cap'n Proto, so I always start with simple f and then create complex variants like f' and f''. And the suffix names is what is used to indicate variants. I want to maintain this conceptual integrity.

Which means by the way that the underlying JS functions get composed together for each handler, but handlers themselves are rather more atomic and uncomposable (they technically be composed on the client side, but it's not as efficient as composing the procedure on the server side as it involves back and forth data transfer).

@aryanjassal
Copy link
Contributor Author

Apparently the RawHandler will not, by default, serialize errors. If an error is thrown, then it would propagate to the transport layer, and cause a read: 0 at the client. To handle this, the serializer needs a way to serialize errors, and the parser needs a way to watch for and parse thrown errors. The code for doing this can be taken from js-rpc and how it is currently handled in other RPC handlers like UnaryHandler, etc.

This, among the other implementation requirement, makes this PR take too long, and as such, won't be part of the first iteration. The first iteration of the secrets edit command will be a Polykey CLI change only, creating a file if it doesn't already exist. Using RawHandlers for better streaming is out of scope for that feature, and will be worked upon in future iterations. The code in this PR will be cleaned up, and worked upon later.

@CMCDragonkai
Copy link
Member

@tegefaulkes Is the primary purpose of this PR mostly to improve performance of secret editing commands? As in, this is due to file content transfer between the PK client and the PK agent? I can see how raw content might help for larger files, especially if you need to stream edits, or block edits, a single unary call wouldn't get you any sort of progress update.

However I fear that doing this across all of our calls is going to make integration for other clients much more difficult in the future, necessitating streaming operations for them. Have you considered the trade offs here?

Comment on lines +613 to +614
streamToAsyncGenerator,
asyncGeneratorToStream,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these should really be in its own utility system.

Also don't we already have ixjs combinators available? Doesn't that already do this? Usually that library has these features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the ix library 5.0.0. I feel like there would already be some combinator for async iterator to web streams.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A file called fileTree.ts under src/vaults? Is this right? What is this file for? Usually utilities go until utils/ otherwise this should be a class based module. What's the meaning behind this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please start writing module-comments typedoc style, as well as function doc comments too.

@tegefaulkes
Copy link
Contributor

When I was creating the utilities I was assuming the serialisation as a raw stream was going to be a necessity for some of the commands. Especially some of the more complicated ones such as mv and cp since they need to operate over a whole tree.

We may have to have a discussion on this to clarify the constraints and get on the same page @CMCDragonkai.

@CMCDragonkai
Copy link
Member

When moving within a vault or between vaults, that wouldn't involve any serialisation.

When moving into and out of the vault. That may involve a serialisation or deserialisation of the tree structure.

Instead of inventing a new serialisation, why not use a standardized container format? Like zip or tar.

@tegefaulkes
Copy link
Contributor

That is one option. I'll need look into it once I get some time to go over it.

@CMCDragonkai
Copy link
Member

A standardized format would be easier for other implementations to do. Plus this means we can introduce this: https://github.com/matrixai/js-virtualtar. I never finished the implementation, but it would be cool to do.

Plus as a tar thing, it doesn't compress, but that's unnecessary since we expect the transport layer to introduce compression if necessary - no need double compression.

As a streaming container format, it can be stream deserialized into file contents, basically it's perfect for what we want to do.

It's also suitable for binary streams.

So I vote for virtualtar over a custom format, and not zip, because zip compresses things and we don't really need that, plus we don't need random access when streaming over the network.

Note that if we want to eventually support resumable uploads/downloads, one would also need implement deterministic tar. But that's down the line.

Plus by using tar, this actually enables another command secrets tar which would allow users to easily call this to stream out a particular archive of a secret contents, as an onramp/offramp, and even pipe such a command using - as the option to gzip, gnupg or any other unix command. It's perfect for this!

@tegefaulkes
Copy link
Contributor

It will definitely simplify things if it's easy to integrate. I do like the idea but it will need to work with a DIed fs and be able to work with the EFS directly. If possible we really want to avoid writing to the disk just to construct the tar file. Not the end of the world but the less the secrets touch the disk the better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants