Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of directories with S3Path #224

Open
omus opened this issue Nov 1, 2021 · 3 comments
Open

Behaviour of directories with S3Path #224

omus opened this issue Nov 1, 2021 · 3 comments

Comments

@omus
Copy link
Member

omus commented Nov 1, 2021

In looking into how S3Path handles interactions with directories I realized the current behaviour wasn't documented very well and no comparison exists to alternatives that could exist. This issue will attempt to serve as a record of different options that can be used when treating non-hierarchical object storage akin to a file system.

A little necessary context. S3 objects have a key name which uniquely identifies the object in the bucket. Objects can be grouped by using keys with similar prefixes and a delimiter (typically "/"). Finally, objects that use "/" as the last character in the key are treated as folders. If you upload an object with a key with a trailing "/" which is empty (0-bytes) then the S3 console will treat this as an empty directory (missing official reference)

Option 1
Make S3 be as file-system like as possible:

  • mkpath: Make a zero-byte file to create each directory and sub-directory.
  • isdir: checks that an object exists with the specified prefix
  • Deleting the last object in a directory leaves all directories intact. Deleting an empty directory leaves parent directories intact.

Option 2
Make S3 like a file-system but optimize using some object storage characteristics:

  • mkpath: Make a zero-byte file to create only the lowest level directory.
  • isdir: checks that an object exists with the specified prefix
  • Deleting the last object in a directory leaves all directories intact. Deleting an empty directory automatically removes all unpopulated directories

Options 3a
Embrace prefix storage:

  • mkpath: no-op
  • isdir: checks that an object exists with the specified prefix
  • Deleting the last object in a directory automatically removes all unpopulated directories. Some extra cautious code could fail if it checks for isdir after mkpath

Options 3b
Same as 3a but:

  • isdir: always false, embrace that directories don't really exist on S3. Should work fine with isdir(x) || mkpath(x)

Options 3c
Same as 3a but:

  • isdir: true, as isdir is usually paired with directory creation just state that all directories exist. Probably not a good idea if someone expects that files/directories cannot exist with the same name (they can on S3)

Related: I also think we should strive to have the AbstractPath interface work in a way that generic functions can be created which work on a variety of different paths without having to have any special cases.

@omus
Copy link
Member Author

omus commented Nov 1, 2021

Currently our implementation is "Option 1". I think "Option 2" is a faster and non-breaking option we should probably use.

I should point out that with our current implementation: if a object is added into a bucket with external tooling then using mkpath with that object's prefix will result in no empty files being written. This is important to note as it calls out the code in AWSS3 cannot rely on zero-byte empty directory files existing

@ericphanson
Copy link
Member

ericphanson commented Nov 2, 2021

Relatedly, should readdir error if a directory “doesn’t exist” (ie a prefix with nothing there) or should it return an empty vector? A few cases:

  • a non-empty file exists there already
  • it does not end with a / but nothing exists there already

edit: oops, clicked close by accident

@omus
Copy link
Member Author

omus commented Nov 2, 2021

Relatedly, should readdir error if a directory “doesn’t exist” (ie a prefix with nothing there) or should it return an empty vector?

I think the typical readdir behaviour should be to error when isdir is false. This heuristic doesn't really work with 3b though.

a non-empty file exists there already

I may be misunderstanding this cases but if there is a non-empty file then I'd expect readdir to return a vector with at least one entry.

it does not end with a / but nothing exists there already

Using readdir on a key that does not end in / currently doesn't work as it isn't deemed a directory. For generic programming on paths we probably should say a call to readdir should interpret the provided S3Path as a directory and automatically append the /. This keeps inline with the readdir behaviour.

Specifically, for your case since no object exists with that prefix readdir would raise an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants