Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: info-dict extension to the File-Transfer Protocol #39

Open
afontenot opened this issue Apr 25, 2023 · 14 comments
Open

Proposal: info-dict extension to the File-Transfer Protocol #39

afontenot opened this issue Apr 25, 2023 · 14 comments

Comments

@afontenot
Copy link

The following is a draft of a protocol extension.

Versions:

  • 2013-04-25: initial draft

File-Transfer Protocol (info-dict extension)

The Magic Wormhole File-Transfer Protocol involves two stages. In the first, a Wormhole connection is mediated between the sender and the receiver by a third party Rendezvous Server. The connection is established by a PAKE which results in encrypted communications not readable by a third party, including the server.

In the first stage, at present, the sender provides an offer to the receiver. This offer is currently one of three types:

  • message for a text message
  • file for sending a single file
  • directory for sending a directory of files compressed into a single archive file

If the receiver accepts the offer, the protocol moves into the second stage. The transit protocol involves the transfer and validation of the message, file, or archived directory (as appropriate) over a different connection, which is created using connection hints sent over the Wormhole. Once this transit connection is created, the Wormhole is typically closed.

This extension to the File-Transfer protocol involves two components:

  • The addition of an info key (and associated values) to the offer message
  • A specification for transit messages for managing transfers related to info offers

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The info-dict offer

The info dictionary offer is based on the info dictionary from the BitTorrent protocol, and it is intended that info dictionaries from compliant BitTorrent v1 torrents also be valid under this specification, when encoded as JSON.

An info-dict offer enables three features that are not available in other modes:

  • Sending whole directories directly between sender and receiver without requiring archiving or compression
  • Partial downloads for the case where the receiver only wants some of the files that the sender is offering
  • Trivial resumption of partially complete downloads by comparing existing piece hashes to those made available by the sender

To offer an info-dict for download, the sender SHALL include an info key in the offer dictionary. This info key SHALL include the following top level keys:

  • name is the suggested file name (if the offer contains a single file) or directory name (if the offer contains multiple files).
  • piece length is an integer and is the base 2 logarithm of the piece size (as defined below) in bytes.
  • pieces is a string containing the concatenated hashes of every piece in the offer. The order of the hashes is the order of the files (if the offer contains multiple files), and then in-order through each file from beginning to end. The pieces are the offered file set split into chunks (which can span across files) of size equal to the piece size, which is 2 ^ piece length.

The info key SHALL also contain either but not both of the following keys:

  • length is the exact number of bytes in the file, if the offer contains only one file (with the name given by the name key).
  • files specifies a list of files provided by the offer, if the offer contains multiple files (with the top level directory given by the name key). The ordering of the files is significant as it specifies the relationship between the pieces and the files. Each item in the list is a dictionary that SHALL contain the following two keys:
    • length is the number of bytes in the file
    • path is a list of strings providing the path to the file under the top level directory. In this list, the last string is the suggested name of the file. Each string before this final string (if any) is the name of a directory contained by the directory immediately preceding it, and the first directory in the list (if any) is contained by the top level directory.

Receivers SHALL either ignore or replace characters in the path that are invalid on their operating system or file system, or reject the offer if it contains these characters. In addition, receivers SHALL NOT interpret any path component that would cause directory traversal (such as a ".." component on some systems) or placing files outside the top level directory.

The info key SHOULD also contain the following key:

  • hashtype specifies the hash that is used for the piece hashes provided in the pieces string.

Receivers compliant with this specification SHALL support the following values for hashtype: "sha256", "blake", "blake160", and "sha1". Receivers MAY support other hash types. The "sha256", "blake", and "sha1" values indicate that the corresponding hashes are standard SHA-256, BLAKE2b, and SHA-1 hashes (respectively) with the default digest size. The value "blake160" is the BLAKE2b hash function with a digest size of 20 bytes. This was chosen to correspond to the hash size of the SHA-1 function (which is used by the BitTorrent protocol), while retaining excellent resistance to collision attacks.

An info-dict offer with no provided hashtype SHALL be interpreted to have a hash type of SHA-1 for historical compatibility. However, senders SHOULD provide a hashtype value and SHOULD NOT use the SHA-1 hash.

Senders and receivers SHOULD NOT limit the piece size beyond the expected limitations of the hardware they run on. It is RECOMMENDED that senders default to a 64 MiB piece size (2^26 bytes). Where the final piece of the last offered file does not coincide with the exact piece size boundary, the hash for the piece SHALL be the hash of the actual data, with no padding.

The receiver SHALL indicate acceptance of an info-dict offer in the same way as for other offers under the File-Transfer Protocol.

Transit protocol extensions for info-dict support

This specification is intentionally opaque about the nature of the transit protocol. The only requirement is that the protocol support both transfer of binary data as well as JSON-encoded control messages, and that both the sender and receiver be able to distinguish the two.

In particular, it is not specified whether the connection happens directly or through a relay server, whether the connection is TCP or UDP, or whether a single stream or multiple simultaneous streams are used.

However, typical connections will be established using connection hints as specified in the File-Transfer Protocol specification, and they are expected to be encrypted using secrets exchanged through the rendezvous connection. See the Wormhole Transit Protocol specification for more information.

Clients compatible with this specification add support for several message types over the transit protocol.

Receiver size hints

Immediately upon establishing a transit connection, a receiver SHOULD send a message containing a wants key. If provided, this key MUST contain a value indicating the exact number of bytes from the offer that the receiver expects to request. A client on the sending side SHOULD use this information to provide an accurate indication of progress, if the client provides progress indicators.

If for any reason (except for checksum validation errors) the number of bytes the receiver expects to download changes, the receiver SHOULD send an updated wants message. These messages MUST contain the total number of bytes the receiver expects from the entire transfer, including from pieces already downloaded. Receivers that send this message MUST NOT double-count bytes from pieces that fail checksum validation or are otherwise downloaded multiple times.

Receiver requested pieces

Pieces are sent by the sender only when they are requested by the receiver. Receivers queue up pieces to be sent with a request message. Receivers SHOULD keep enough requests queued up that they are not left waiting for data between downloading pieces. Requests messages SHALL take the following form:

{
    "req": [
        0,
        1
    ]
}

Here, the numbers indicate the (zero-indexed) offset to the pieces provided in the offer. Note that the sender and receiver can determine both the byte offset (in the set of offer files) and the hash offset (in the pieces string), because both the piece size and hash digest size are defined in the offer.

Receivers SHOULD always request the pieces they want in numerical order. Requesting data sequentially through the files allows for more efficient, predictable i/o on many systems.

Upon receiving a request for a piece, the sender SHALL send it through the transit protocol in the appropriate manner for binary data.

Accepting / rejecting / re-sending pieces

The sender SHALL check the hash digest given in the offer for each piece as it comes in, and SHALL reject any piece that does not match, unless strong mitigating circumstances prevail. Examples of such circumstances include that the sender has an incorrect or incomplete copy of the file, and the user / operator of the receiver has actively requested to accept data that fails a checksum error. If such circumstances are expected to occur, receiver software MAY choose to implement support for ignoring checksum failures, with an appropriate warning.

When a piece fails a check, a receiver MAY choose to request the same piece again. Senders are RECOMMENDED to provide a piece again if requested. Either side MAY choose to hang up the connection if a request repeatedly fails.

Acknowledgements

When a piece succeeds, the receiver SHALL send an acknowledgement in the following form:

{
    "ack": [
        0,
        1
    ]
}

Note that the receiver MAY send individual acknowledgements for each piece separately, but if multiple pieces enter the finished state before it sends an acknowledgement, it MAY acknowledge both at once as shown above.

Ending the connection

At any point, the receiver MAY hang up the connection with a success indication by sending

{
    "ack": "ok"
}

FAQ

  1. Why include support for legacy hashes like SHA-1?

    The intention is to make adding support for this protocol extension as easy as possible for implementers. A large quantity of software already exists for creating handling BitTorrent format info dictionaries, and using this software is likely to be the quickest way to implement support in many cases. Furthermore, collision resistance is rarely relevant to file transfer cases. Preimage resistance is far more important, and SHA-1 retains this. Other than the blake160 hash with a non-standard digest length, SHA-1 also has the shortest digest of any hash with REQUIRED support in this specification. Shorter hashes make for more efficient info dictionaries.

  2. Is this implementing BitTorrent support for Magic Wormhole?

    No. This protocol extension provides a new offer format that allows sending a set of files between a single sender and (usually) one receiver, where the metadata provided for the offer is compatible with that used by the BitTorrent info-dict specification, but the protocols are otherwise unrelated.

  3. What does this achieve that Magic Wormhole cannot achieve without it? Is using BitTorrent a better choice for this use case?

    As mentioned above, this allows sending multiple files without involving the overhead of an archive format, as well as partial downloads and updates to previously shared data. BitTorrent is not a plausible alternative to this use case. In particular, with this extension, Magic Wormhole implements:

    • a highly secure connection between a sender and receiver. BitTorrent does not support modern, secure forms of encryption between clients

    • an efficient transport mechanism for one-to-one and one-to-many transfers, thanks the opacity of the file transfer protocol to the underlying transit protocol. BitTorrent is optimized for the many-to-many case, and only creates a single TCP or UDP connection between pairs of peers.

    • a conversation establishing mechanism for two peers who want to talk to each other, and no one else, via the mailbox protocol. BitTorrent would require two peers to know each others' IP addresses and does not provide any mechanism for authentication.

  4. Why emphasize the opacity of the transit protocol so much?

    This feature gives Wormhole clients a lot of flexibility and potential for speed. The author of this extension specification is also working on a transit protocol extension that would allow two clients to keep open multiple transit connections between them and use them simultaneously when exchanging binary data (e.g. the pieces in this specification). Parallel transfers frequently offer an enormous speedup over sequential ones. Hopefully, with both extensions in place, Wormhole clients will be capable of multiple-Gbps transfers on commodity hardware.

@afontenot
Copy link
Author

I would be thrilled to hear comments or suggestions on this proposal. Thanks for reading!

@meejah
Copy link
Member

meejah commented Apr 25, 2023

This sounds like it has a lot of overlap with "Dilation" (and "Dilated File Transfer" or "transfer v2").

Have you seen those?

@afontenot
Copy link
Author

This sounds like it has a lot of overlap with "Dilation" (and "Dilated File Transfer" or "transfer v2").

Have you seen those?

I've read through the dilation spec multiple times. Brief comments:

  • The spec is much less clear than I would like about what exactly it is supposed to do. From the point of view of file transfers, if I'm reading it correctly, it's still supposed to be pretty low-level. It's also got some clarity issues IMO, for example it refers to a "generation" repeatedly but doesn't clearly define this concept.

  • Provided that the dilation spec is sufficiently low level, this file transfer protocol extension would presumably run on top of it.

  • My biggest concern is that whatever problems the dilation protocol is intended to solve (again, this is unclear to me) it doesn't seem interested in solving what I take to be the biggest problem, which is that the current transit protocol specifies that when a single successful TCP stream is opened, all communication happens over that stream and all others are closed. In fact it seems to explicitly keep this feature:

    Each connection must go through an encrypted handshake process before it is considered viable. Viable connections are then submitted to a selection process (on the Leader side), which chooses exactly one to use, and drops the others.

    At all times, the wormhole will have exactly zero or one L3 connection.

I had not read the "dilated file transfer" spec before now. Unfortunately it was a bit hidden in a pull request, so I didn't know it existed. I'm reading through it now, so here are a few (less well considered!) thoughts:

  • It looks like one important difference is that my proposal is a simple extension to transfer v1, without requiring the implementation of dilation (which I think only the Python client supports at present?).
  • The closest thing to what this proposal offers is the DirectoryOffer. This offer has a few requirements that are arbitrary (IMO):
    • It looks like the only option for a receiver who wants a file at the end of a DirectoryOffer is either to download the entire offer or abort. This is a huge limitation, IMO. One thing this proposal is trying to achieve is that the receiver gets a say in what is transferred, meaning that you get transfer resumption and partial transfers for free. You can even resume single files, because the files are chunked with each chunk receiving a checksum.
    • The requirements on stat information are somewhat perplexing. Presumably each file is going to be individually checksummed, and as long as the file received matches what the sender originally intended to send, I'm not sure what the problem is. Obviously, the receiver will have no way to check this data, so they are reliant on the sender to send appropriate data. This strikes me as a specification for preferred client behavior, rather than a specification for the protocol.
    • Data messages are not allowed to be bigger than 65536 bytes. I'm not sure why this restriction exists - my proposal is to send 64 MiB chunks for large files. (Obviously this will get chunked into much smaller packets by a lower level protocol e.g. TCP.)
  • The FileOffer is ambiguous in the DirectoryOffer context. A DirectoryOffer could contain more than one file with the same name (e.g. a/hello.py, b/hello.py). The spec should make clear how the FileOffers are supposed to disambiguate these.
  • The spec should probably specify what the receiver is to do to avoid paths with invalid characters or path traversal.

I'm glad to see someone else is working on file transfer improvements! I'd be delighted to try to merge my work with this, provided that it can accomplish some of the basic things I'm trying to do here:

  • partial downloads
  • download resumption from a cold-start (completely dead) connection
  • simultaneous parallel data transfer streams for a single Offer

If it helps clarify anything, my end goal here is extremely high speed secure transfers for very large datasets (multiple TB at multiple Gbps), for clients that are both behind NAT or firewalls. AFAIK the Rust implementation of Magic Wormhole is the only version that supports hole punching, so it's the biggest target for me.

@meejah
Copy link
Member

meejah commented Apr 26, 2023

So, to explain some context: "Dilation" is definitely a low-ish level protocol change, as you've identified. So only together with "transfer v2" (or "Dilated File Transfer") does it address the sort of "end-user" features you're wanting here (i.e. transferring more than 1 file / directory per wormhole setup).

However, putting in a separate "layer" also makes it more widely useful for other wormhole protocols. At a high level Dilation gives you:

  • durable streams: bytes are delivered, period. Over network changes, laptop sleeps, etc
  • persistent connection: as above, it re-establishes the connection in a wide variety of network conditions and events (as long as both sides keep their mailbox + spake keys)
  • multiplexed subchannels: keep things logically separate, while still using a single TCP connection at a time (between two peers).

So, this (should!) mean that re-tries are redundant: you don't need to re-do a transfer, you just keep the application alive (or, if it correctly serializes state, re-start it) until the transfer concludes. Note that there is a privacy concern with features like "do you already have X?", especially if it operates w/o human involvement: one end could use this to confirm if the peer has particular files.

On top of the above is where the Dilated File Transfer protocol runs. So it has all the benefits of Dilation and some additional features beyond the rather simplistic existing file-transfer (multiple offers / answers, bi-directional transfers, etc).

The stuff about "viable connections" and so on is because you can have multiple connection hints in play (just as with the existing system) and need a way to decide which one to use now. Also, "which one is viable" may change as your local network conditions change. It may be that you can add more options, too (e.g. if you got a public IP address later). So for example, you may have a "direct" hint, two different transit helpers and a Tor Onion hint in play -- and it's possible that more than one will successfully connect.


For Dilated File Transfer itself:

  • the 65k thing is about Noise (i.e. Noise has this limit) and we are in the process of fixing the spec here so that Dilation (and higher) protocols don't have to think about it (they will continue to have a 4GiB limit).
  • the discussion about stat stuff you're referring to is, I believe, about the fact that on-disk files can change between when you make the offer and when the file is ultimately uploaded/transferred

For the "granularity of transfers" part, we did discuss this quite a bit with @warner and others, and the thinking is that we wanted Offers to correspond to "things the user drops", conceptually. That is: a user drops "a Directory" and so that is accepted (or not). A particular client (or user) could make each file an individual Offer if they preferred (all offers can be in parallel) -- so the idea is that for whatever reason "a directory" is a cohesive collection of files. That said, maybe we've gotten this wrong -- I'd love to see some UX research indicating which choice is best. (Also maybe the names are just bad: the important point is that there are two sorts of offers available: "a single file" and "a collection of files"). But, yes, you're correct that the protocol only offers a "y/n" on an entire Offer.

I also believe we've got enough versioning information that we can expand the protocol in the future in case more offer-types are desired. For example, perhaps a third type which is "a collection of files, but you might not want them all" (whatever that is named ;) ).


Least Authority is currently executing a grant to add Dilation and Dilated File Transfer to the Rust implementation (and work on limitations in the specification, and appropriate changes to the Python reference implementation).

If you're interested in collaborating further to ensure the specification either already meets your goals or at least can be extended to do so, I'm happy to schedule a video-call or "meeting" on IRC or similar?


I would be very interested to see any testing data you have on transfer-speeds etc. for "parallel" connections. Obviously, the Dilation specification is (currently) just multiplexing on top of a single TCP stream -- although if it's significantly faster to use more connections, I don't see any reason we couldn't introduce a future revision that multiplexed over multiple streams. Definitely in protocols like BitTorrent there are obvious advantages since the endpoints are different network elements (but here, they are not).

I see a lot of advantages to keeping concerns separate: keeping Dilation at a different layer lets other higher-level Wormhole protocols take advantage. (For example, if it was made parallel then any protocol including file-transfer could immediately use any speedups).

Thanks for the writeup!

@meejah
Copy link
Member

meejah commented Apr 26, 2023

p.s. "resumption" (or I guess "partial file transfer"?) could be a very interesting feature for sure if we can figure out the privacy aspects properly. So I don't want any of the above to imply I'm dismissing that feature, but we left it out of this first revision of Dilated File Transfer because it's complex, hard to get privacy angles correct and may not be needed very often in practice due to the underlying re-try/durability of Dilation itself -- more experience + research needed :)

@afontenot
Copy link
Author

Note that there is a privacy concern with features like "do you already have X?", especially if it operates w/o human involvement: one end could use this to confirm if the peer has particular files.

The situation you're thinking about is that the sender and receiver have a long-lived connection, and the receiver is set to automatically download everything that the sender offers except if the receiver already has the files in question in the target directory? I guess that's a possible issue, though probably far from a common one. At any rate, I think it would be okay to put this behind a flag like it is in wget (-c | --continue).

On the other hand I think benefits go beyond resumption - if I have a ~20 GB directory of log files and I use a long-lived wormhole to sync them to another server, it would be nice not to have to send the full 20 GB every time the transfer happens.

persistent connection: as above, it re-establishes the connection in a wide variety of network conditions and events

This is the sort of thing that's potentially just as privacy impacting as the "do you already have X" thing. E.g. if I'm using a VPN to hide my IP address from the receiver, and the VPN connection drops, I might expect the connection to fail. But if it is automatically re-established and my routes revert to pushing data back over eth0...

the discussion about stat stuff you're referring to is, I believe, about the fact that on-disk files can change between when you make the offer and when the file is ultimately uploaded/transferred

Right - I don't see the point of making this a protocol requirement, because there's no way to enforce it. It's best effort anyway, because a (malicious?) program could absolutely make changes to a file without changing its size or modification time in between when the Wormhole client does its initial directory walk and when it actually sends the files. The only way to do this 100% safely is to checksum a file in advance, send the checksum to the receiver in the initial offer, and then you can check it again (and the receiver can check it) while you're streaming it in at some point in the future.

(Maybe discussion of the dilated protocol needs to happen elsewhere. This is all nitpicking on my part, which could be a helpful contribution in some contexts but probably not here.)

the thinking is that we wanted Offers to correspond to "things the user drops", conceptually. That is: a user drops "a Directory" and so that is accepted (or not). A particular client (or user) could make each file an individual Offer if they preferred (all offers can be in parallel) -- so the idea is that for whatever reason "a directory" is a cohesive collection of files.

I can only make a partial case for each of the features I've proposed, but my view was that because they go together so well, to have support for one of them is to get the others for free. (Or at least, easy support for them in the protocol.) Partial transfers, resumption from cold start, checksums (and retransmissions) for individual failed pieces, etc etc, all come together in one nice package if you allow the receiver to specify which files they want from an offer. So as my proposal implements it, the receiver "requests" pieces from an offer (after accepting it), rather than accepting and then receiving it passively.

I would be very interested to see any testing data you have on transfer-speeds etc. for "parallel" connections.

A very trivial test I performed right now: on my home Internet connection, an iperf3 to a nearby server gives me 695 Mbps over a single TCP stream. With 2 parallel TCP streams that goes up to 941 Mbps, roughly the maximum the hardware supports until I get around to upgrading it.

Or more to the point, the last time I had a major use case for Wormhole was assisting some scientists trying to transfer a large amount (~1TB) of data between computer nodes across the country. Unfortunately, both nodes were behind firewalls that they had no control over. You could do sftp over a VPN connection but it was horribly slow. I ended up solving the problem by figuring out how to expose the files on an http server. One download thread gave us ~30 MB/sec. With four parallel threads we got it up to 110 MB/sec, and only stopped there because we were concerned about load. (Fortunately, the data was sufficiently public that exposing it on the server was not an issue, but this is not always the case.)

More generally, the speed of a TCP stream is limited by the amount of data that can be in flight, which is the size of the TCP buffer for the socket divided by the latency (round trip time) of the connection. E.g. the default Linux buffer is something like 3 MB, so over a 100 ms connection (what we were dealing with in the case above), that's a limit of 30 MB/sec. That's why you can often see a speed up proportional to the number of TCP streams. (Some systems, e.g. the BSDs I believe, have difficulty handling high bitrates on a single TCP stream because the processing is single threaded. It's been a while since I looked at this, but back when I used an OPNsense router, I couldn't route more than 300 Mbps in a single stream on low power hardware.)

None of this an issue BitTorrent clients, but only because wanting to download at 30 MB/sec or more from a single peer would be extremely rare. So it's okay for BitTorrent to only do 1 connection per peer.


If you're interested in collaborating further to ensure the specification either already meets your goals or at least can be extended to do so, I'm happy to schedule a video-call or "meeting" on IRC or similar?

Thanks for the offer. I may end up taking you up on that after I understand the current work being done better. If there's an IRC, Matrix channel, or maybe even a mailing list (!) for wormhole development, I'm happy to idle / subscribe to have an ear on what's happening. (It would be helpful not to miss stuff like dilated file transfers.)

My biggest concern at present is that multiple streams for a single file transfer is likely to be hard to add in to the protocol later. :-) Your file transfer protocol needs to say "I don't care which transit pipe I get this data from, as long as I can clearly identify it", and then it doesn't seem that bad (at least with v1 transit, I'm not clear enough on the dilation specifics to comment on that). Most of the work has to be done in the low level transit protocol, and the more complex this is, the harder it is to modify.

Is there a timeline for any of the work that is happening? The dilation stuff dates to 2018, and to be honest I assumed it was dead or happening very, very slowly. If we are hoping to do file transfers over the dilated protocol in the Rust client by the end of the year, that changes things significantly.

@meejah
Copy link
Member

meejah commented Apr 26, 2023

We do have an IRC channel, on Libera #magic-wormhole (I'll warn it's pretty low-traffic though).

@meejah
Copy link
Member

meejah commented Apr 26, 2023

My biggest concern at present is that multiple streams for a single file transfer is likely to be hard to add in to the protocol later.

I wouldn't add this to the transfer protocol, I'd add it to Dilation.

Then the transfer (or whatever other higher-level) protocol still doesn't have to care about transport details -- it just opens a subchannel (and they happen to multiplex over however many TCP connections you want). This is of course an extremely complex topic.

That's part of the point of separating the transport (Dilation) from the protocol (file transfer) here: they each deal with separate concerns. Of course, the Dilation API needs to be rich enough to handle application-protocol concerns. (Currently, for example, you can't obviously express "open a subchannel, but not on the same stream as this other one").

@meejah
Copy link
Member

meejah commented Apr 26, 2023

On the other hand I think benefits go beyond resumption - if I have a ~20 GB directory of log files and I use a long-lived wormhole to sync them to another server, it would be nice not to have to send the full 20 GB every time the transfer happens.

Okay, this is more interesting -- in other discussions about use-cases etc we've thought of this as "definitely about file-transfer" and not, e.g., "a synchronization protocol". For example, with UIDs, GIDs, timestamps, symlinks, hardlinks, etc. there are lots of issues here. So, I don't think we see "synchronize these two directories" as a direct use-case for the protocol (and I guess as you point out here, it's not currently going to be very great at that).

However, perhaps that could be expressed as another offer-type? Or something? (I would also ask: "why not rsync?" or maybe "can we make rsync work over a wormhole?")

I have played around with Dilation for general "forwarding of connections" (as has the Rust implementation), see e.g. https://meejah.ca/blog/fow-wormhole-forward (definitely "proof-of-concept" territory) and further thought of this as a possible "integration point" for experiments: anything that can be expressed as a localhost-listening server or a localhost-connecting client can "do stuff over wormhole" easily in this manner (at least as an experiment).

@meejah
Copy link
Member

meejah commented Apr 26, 2023

This is the sort of thing that's potentially just as privacy impacting as the "do you already have X" thing.

Yes, absolutely. The "hint" system does have to be careful here. Currently, that's expressed as "use Tor" (or not) and if you're "using tor" then it only does Tor hints.

(I don't think it's impossible to get a "partial file" system working well, but there are some extra things to get right here for sure).

@meejah
Copy link
Member

meejah commented Apr 26, 2023

I don't see the point of making this a protocol requirement, because there's no way to enforce it. It's best effort anyway, because a (malicious?) program could absolutely make changes to a file without changing its size or modification time in between when the Wormhole client does its initial directory walk and when it actually sends the files.

Yeah, some more eyes on the Dilated File Transfer (and/or Dilation) would be great -- please do feel free to comment on those parts directly too (e.g. on #23 for the Dilated File Transfer stuff).

Re-reading that section now (I think you mean around line 226) it does seem fairly prescriptive in implementation details (when the real ask is that the peer should try its best to ensure the "actually sent" files match what was originally offered). Indeed, a hash is probably the only good way to ensure they do in fact match. I believe the thinking there might have been that it's faster to do stat() calls than hash everything? But, I think this is a good point (and the stat() details should be left out of the spec).

@meejah
Copy link
Member

meejah commented Apr 26, 2023

The situation you're thinking about is that the sender and receiver have a long-lived connection, and the receiver is set to automatically download everything that the sender offers except if the receiver already has the files in question in the target directory?

I was thinking along the lines of: "here is Snowden Docs.zip" and if it completes right away, I know you already had it. Somewhat niche case for sure (and making it optional could be a fine mitigation).

@meejah
Copy link
Member

meejah commented May 9, 2023

To the general point about "offer a thing, but only take some of it" -- would that use-case be answered by having a new offer-type that's like the Directory offer, but gives the receiver an opportunity to answer back with more than a "yes/no"?

That is they can reply with "yes, but: only A, D and Z..." or similar?

And to clarify timeframes: yes, we expect to be doing Dilated File Transfer over Rust before the end of 2023. (There is PoC-level code for this in Python already, so I also expect to have full-featured Python support in that timeframe as well).

I do like incremental approaches where possible, hence the "features" flags etc. So, for example, the above suggestion could take that form (similar to compression) and thus be implemented on a longer timeframe.

@abitrolly
Copy link

@afontenot it would be nice to answer "why this exists" directly upfront. As I understand it, the answer is given later.

  • Sending whole directories directly between sender and receiver without requiring archiving or compression
  • Partial downloads for the case where the receiver only wants some of the files that the sender is offering
  • Trivial resumption of partially complete downloads by comparing existing piece hashes to those made available by the sender

I don't think uncompressed transfer is a feature. Ability to negotiate compression protocol maybe a feature, but for one shot transfers KISS is better.

"Receiver wanting only some files" looks like a some attempt of bandwidth optimization through increasing server, client and usage complexity. Again, why sender uses one shot transfers to transmit unneeded files, and why client needs to filter them on protocol level? There are more maintainable ways to handle this use case.

Resumption of downloads looks like a wrong protocol choice too. There are many file syncing solutions out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants