Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default metadata #477

Open
9 tasks
pierrotsmnrd opened this issue Aug 5, 2024 · 3 comments
Open
9 tasks

Default metadata #477

pierrotsmnrd opened this issue Aug 5, 2024 · 3 comments
Labels
type: enhancement 💅 New feature or request type: RFD ⚖️ Decision making

Comments

@pierrotsmnrd
Copy link
Contributor

Feature description

This issue lists the default metadata that would be nice to have :

  • original file path
  • filename
  • complete file extension (for instance, a file named xyz.foo.bar would have its complete file extension set to foo.bar
  • file extension (with the previous example, the last file extension would be bar)
  • creation date
  • last modification date
  • date of upload
  • filesize
  • username having uploaded the file

Not all metadata might be available when uploading a file, we need to figure out which ones are possible.

Value and/or benefit

No response

Anything else?

No response

@pierrotsmnrd pierrotsmnrd added the type: enhancement 💅 New feature or request label Aug 5, 2024
@pmeier
Copy link
Member

pmeier commented Aug 6, 2024

Before I go over the individual proposal, one thing upfront: although we use ragna.core.LocalDocument by default, the user is free to use any subclass of ragna.core.Document:

document: ImportString[type[Document]] = "ragna.core.LocalDocument" # type: ignore[assignment]

class Document(RequirementsMixin, abc.ABC):

class LocalDocument(Document):

The only metadata attached to a plain Document is the ID and the name of the document:

self.id = id or uuid.uuid4()
self.name = name

Subclasses can add more metadata to this, e.g. LocalDocument adds the path:

if metadata is None:
metadata = {}
elif "path" in metadata:
raise RagnaException(
"The metadata already includes a 'path' key. "
"Did you mean to instantiate the class directly?"
)
path = Path(path).expanduser().resolve()
metadata["path"] = str(path)

All this is to say: we need to differentiate between metadata that we can add to all Documents or metadata that only applies to LocalDocuments.


  • original file path

Not applicable to Document and already available for LocalDocument under metadata["path"]

  • filename

Available on all documents in Python with document.name and for MetadataFilter canonically with "document_path"

  • complete file extension
  • filesize

Maybe applicable to all documents, but certainly to LocalDocument. If we want to add it as metadata for all documents, I would like to have a compelling use case. I currently can't think of one.

  • creation date
  • last modification date
  • date of upload

What format would the metadata be for these?

  • If timestamp strings, how would we filter on them?
  • If unix timestamp floats, how would we create a compelling UI for that so users don't have to deal with them directly?
  • Something else?
  • username having uploaded the file

The username is stored in the Ragna DB, but not available for filtering.

class Document(Base):
__tablename__ = "documents"
id = Column(types.Uuid, primary_key=True) # type: ignore[attr-defined]
user_id = Column(ForeignKey("users.id"))

What would be the use case here?

@pmeier pmeier added the type: RFD ⚖️ Decision making label Aug 6, 2024
@pierrotsmnrd
Copy link
Contributor Author

  • filesize use case : It might be useful in order to keep, for example, "all the PDFs big enough to have images"
  • creation / last modification / upload dates : I'd recommend the format "%Y-%m-%d %H:%M:%S", so we can filter on all files uploaded after a given day for example
  • username : The use case would be to filter only on documents uploaded by yourself, or by let's say the legal department, etc

@pmeier
Copy link
Member

pmeier commented Aug 7, 2024

  • filesize use case : It might be useful in order to keep, for example, "all the PDFs big enough to have images"

Let's start with adding that to LocalDocument. There we can be sure that the information is available. I'll send a PR.

  • creation / last modification / upload dates : I'd recommend the format "%Y-%m-%d %H:%M:%S", so we can filter on all files uploaded after a given day for example

I understand the intention, but how would that be implemented? We can't do numeric comparisons like > on strings?

  • username : The use case would be to filter only on documents uploaded by yourself, or by let's say the legal department, etc

If this is required, IMO the user should just have their own corpus or use tags for the department.

To look at it from the other side: what if we have an admin upload documents for the organization. Is the username useful information in this case?

I'd leave this out for now until a concrete use case arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement 💅 New feature or request type: RFD ⚖️ Decision making
Projects
None yet
Development

No branches or pull requests

2 participants