Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Controls for data schema for images when exporting datasets and records #5458

Open
burtenshaw opened this issue Sep 4, 2024 · 1 comment
Assignees

Comments

@burtenshaw
Copy link
Contributor

Is your feature request related to a problem? Please describe.

When using argilla responses in a downstream task like model training, only some of the information from argilla is necessary. Mainly the responses to questions.

Also, if Argilla datasets contain larger media formats like images, getting just these responses is cumbersome and time consuming. Users might want to skip these fields, or get the original local file paths.

Describe the solution you'd like

  • A simple solution is to support with_fields=False in DatasetRecords so that a user can iterate over only the responses and align them with the source dataset based on record id
  • A more advance feature would allow the user to define a mapping between argilla and a hf dataset. In the same way that DatasetRecord.log works. So that sub components of Argilla fields and questions could be assigned to specific dataset columns, using dot notation.
  • For ImageField specifically, a record attribute that relates to other string formats of images could be stored (url, uri, filepaths), so that users can retrieve those instead of the PIL object.

Describe alternatives you've considered

The only current solution is to export everything to_datasets and drop or manipulat rows.

Additional context

@burtenshaw burtenshaw added this to the v2.2.0 milestone Sep 4, 2024
@burtenshaw burtenshaw changed the title [FEATURE] Controls for data schema when exporting datasets and records [FEATURE] Controls for data schema for images when exporting datasets and records Sep 4, 2024
@nataliaElv nataliaElv modified the milestones: v2.2.0, v2.3.0 Sep 10, 2024
@burtenshaw
Copy link
Contributor Author

I think that we should implement:

A simple solution is to support with_fields=False in DatasetRecords so that a user can iterate over only the responses and align them with the source dataset based on record id

However, this goes against the backend data model, where fields are a part of the Record object and other attributes are not. The fields would then need to be removed, rather than not added, like suggestions etc.

@frascuchon @jfcalvo How do you think we should approach this?

@burtenshaw burtenshaw modified the milestones: v2.3.0, v2.4.0 Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants