Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support symlinks and other types in ls #27

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Erotemic
Copy link

Fixes #26

Adds some documentation to reference where to learn more about the gateway rest calls, and fixes an issue where ls would fail when a directory contained symlinks or non directory non file data.

I'm not sure what raw or shards are, but ls will list them now. I exepct other parts of the library will have issues with dealing with theses as well.

Copy link
Collaborator

@d70-t d70-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Erotemic for bringing this in 👍.

Apart from the specific comments, please note that generally the async implementation of ipfsspec should be preferred, and probably the synchronous implementation should be dropped at some point (it's relatively easy to use the async implementation in a synchronous setting, but not the other way around. Further, the async implementation should in principle perform better).

Comment on lines +21 to +24
Documentation for the REST API can be found here: [Kubo_RPC_API]_.

References:
.. [Kubo_RPC_API] http://docs.ipfs.tech.ipns.localhost:8080/reference/kubo/rpc/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess as long as we are talking read only, we might want to stick to the HTTP Gateway API (not the full RPC API). This should always be available on locally running kubo installations, but should work equally well on public gateways as a fallback.

ipfsspec/core.py Outdated Show resolved Hide resolved
Comment on lines +186 to +193
types = {
0: "raw",
1: "directory",
2: "file",
3: "metadata",
4: "symlink",
5: "shard",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also touch what the generic fsspec API expects as a return form ls (and info). I don't know a definitve answer here, but maybe @martindurant could chime in. If I look at e.g. the local filesystem implementation, there are only types file, directory and other. There's additionally the islink flag.
So @martindurant, would raw, metadata, symlink and shard be valid return values for the type? And more specifically, what would be the best return value to communicate the presence of a symlink?

Co-authored-by: Tobias Kölling <[email protected]>
@Erotemic
Copy link
Author

please note that generally the async implementation of ipfsspec should be preferred, and probably the synchronous implementation should be dropped at some point (it's relatively easy to use the async implementation in a synchronous setting, but not the other way around.)

Can you give me an example of how its easy to use async in non-async code? From what I understand you have to create and manage an event loop, which is a lot more code overhead than I want.

Unless I'm missing something and it is a lot easier to use the async variant than I think it is, then I would strongly suggest not dropping the synchronous variant. It's important for immediate feedback when people are working interactively in IPython / Jupyter settings. Also I believe doctests are easier to write with sync code (you don't want to have two lines of setup loop / execute loop boilerplate in a doctest if you just want to access one file).

@d70-t
Copy link
Collaborator

d70-t commented Oct 17, 2023

Ok, maybe the formulation wasn't ideal. To make async code sync, you "just" have to run an event loop and wait for the async code to complete. This has relatively little overhead. On the other hand, if you want to make sync code async, you'll have to create a new thread for every task you start which is much less efficient, especially if you're doing I/O and want to fetch 100s of requests concurrently.

But I agree, there's a bit of boilerplate if you want to make async code look like sync code. Fortunately, fsspec's AsyncFileSystem provides a flag asynchronous which you can use to switch from async to sync API, so in this particular case, it's easy to use. In fact, you'll have the synchronous API by default.

One additional benefit you'll get from the synchronous-from-asynchronous API is, that fsspec allows to pass in multiple requests in one synchronous call (e.g. cat may be called with a list of paths). The sync-async translation is built such that this single synchronous call will spawn multiple asynchronous tasks which will all request their data in parallel, which removes a lot of round-trip time.

An example would be fsspec's HTTP implementation, which is implemented as async internally as well:

import fsspec
print(fsspec.open("http://httpbin.org/uuid").open().read())

or, of course, the AsyncIPFSFileSystem, which you should be able to use in just the same way as the synchronous one.

@Erotemic
Copy link
Author

Erotemic commented Oct 17, 2023

Ah, I see. Thank you for the clarification. I can just use:

import fsspec
fs_cls = fsspec.get_filesystem_class('ipfs')
fs = fs_cls(asynchronous=False)
results = fs.ls("bafybeief7tmoarwmd26b2petx7crtvdnz6ucccek5wpwxwdvfydanfukna")
print(results)

And that's still using the async implementation. Given that I'm perfectly comfortable with avoiding / dropping the synchronous version.

And the async implementation with asynchronous=True would look like this:

async def test_ipfs_async():
    import fsspec
    fs_cls = fsspec.get_filesystem_class('ipfs')
    fs = fs_cls(asynchronous=True)
    session = await fs.set_session()  # creates client
    result = await fs._ls("bafybeief7tmoarwmd26b2petx7crtvdnz6ucccek5wpwxwdvfydanfukna")
    print(result)
    await session.close()  # explicit destructor


if __name__ == '__main__':
    import asyncio
    asyncio.run(test_ipfs_async())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ls fails when a directory contains symlinks (or anything that is not a file or directory).
2 participants