Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add changes for the labelstudio onboarding feature #64

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

elsheikhams99
Copy link

@elsheikhams99 elsheikhams99 commented Aug 16, 2023

The Label Studio platform encompasses distinct projects, each distinguished by the Label Studio authentication API key employed. Within these projects, a variety of tasks are hosted, often spanning multiple data types. The aixplain.processes.data_onboarding.labelstudio_onboard method has been crafted to adeptly manage both audio and text data, whether originating from a Label Studio task or a project.

For successful utilization of this method, the requisites are straightforward: you must possess the relevant Label Studio project or task ID, as well as the corresponding Label Studio authentication API key.

Copy link
Contributor

@krishnadurai krishnadurai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the Dataset Factory: https://github.com/aixplain/aiXplain/blob/main/aixplain/factories/dataset_factory.py
Create a class like a Dataset Factory for LabelStudioData (choose an appropriate name). I want the section of your code on Jupyter NB to be looking like the following:

Assuming that the LABEL_STUDIO_KEY is set in an environment variable.

payload = labelstudiodata.create(
    data_name = data_name,
    project_id = project_id,
    onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?
    data_description = data_description
)

Instead of:

payload = labelstudio_onboard.auto_onboard(
    labelstudio_key = labelstudio_key,
    data_name = data_name,
    project_id = project_id,
    onboard = onboard,
    data_description = data_description
)

Please work towards this change in the Notebook. Secondly, clean up the output from the cells before committing for the time being.

@elsheikhams99
Copy link
Author

onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?

When I was writing the program, I thought that the user may want to just retrieve the data and save it to a .CSV file (that's what happens if onboard=False), to manually review before onboarding/use it for other purposes. If you think that it won't be beneficial, let me know and I can remove it.

For example, the labelstudio auto onboarding script doesn't specify the aixplain.enums.Language of each column. It's not possible to do that unless it's interactive. So, the user may set onboard to False, and manually create the metadata for the columns in the dataset, so that they could specify the aixplain.enums.Language.

@krishnadurai
Copy link
Contributor

onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?

When I was writing the program, I thought that the user may want to just retrieve the data and save it to a .CSV file (that's what happens if onboard=False), to manually review before onboarding/use it for other purposes. If you think that it won't be beneficial, let me know and I can remove it.

You should look to introduce a function in LabelStudioData to download this data as a CSV separate from the create function. It is beneficial to the user, although it shouldn't be a part of the create function. You can compose your create and download functions by using sub-functions to reuse code.

For example, the labelstudio auto onboarding script doesn't specify the aixplain.enums.Language of each column. It's not possible to do that unless it's interactive. So, the user may set onboard to False, and manually create the metadata for the columns in the dataset, so that they could specify the aixplain.enums.Language.

Can you design your flow in such a way that the language metadata addition for the create function is part of your on-boarding flow? I understand that there are couple of options:

  1. Make this flow interactive with the user supplying the metadata in one of the steps while or before create is called.
  2. Interpret or detect the languages automatically in the create functionality. Language detection should be a simple addition.

cc @thiago-aixplain

@thiago-aixplain
Copy link
Collaborator

@elsheikhams99

aixplain.processes is supposed to be abstracted from the user. The idea of the SDK is to handle aiXplain assets, available in aixplain.modules. The CRUD (create, read, update and delete) operations to the assets are done by "asset factories", available in aixplain.factories. I would recommend to move your auto_onboard method from processes to a LabelStudioFactory renamed to create as @krishnadurai suggested. The goal of this method is to create a LabelStudioProject asset with project id, tasks, data and any other information from a LabelStudioProject; or a Corpus asset from aiXplain:

LabelStudioFactory.create(
    labelstudio_key: str,
    corpus_name: str,
    task_id: Optional[int] = None,
    project_id: Optional[int] = None,
    columns_to_drop: Optional[List[str]] = None,
    onboard: Optional[bool] = False,
    corpus_description: Optional[str] = None
) -> Union[Corpus, LabelStudioProject]:

output = response.content
return output

def getAllTasksPerProject(url, header, project_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow snake case standard:

get_all_tasks_per_project.

Moreover: use typing conventions:

get_all_tasks_per_project(url: Text, header: Dict, project_id: Text)

Returns:
list: A list containing task IDs of all tasks in the specified project.
"""
output = list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output = []

output.append(taskInfo['id'])
return output

def extractData(dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake case and typing

else:
return {'Error': 'No data'}

def getAllDataForAProject(url, header, project_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake case and typing


def auto_onboard(
labelstudio_key: str,
data_name: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "data", use corpus to refer to the data collection so that:

data_name -> corpus_name
data_description -> corpus_description

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-FYTtyVaDxyVv7kGCaMEd5E3uiHYHRTt?usp=sharing)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-FYTtyVaDxyVv7kGCaMEd5E3uiHYHRTt?usp=sharing)

### Label Studio Corpus Onboarding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate documentation with Nur please

@elsheikhams99
Copy link
Author

@krishnadurai @thiago-aixplain I will work on those changes.

@elsheikhams99
Copy link
Author

Hello @krishnadurai , and @thiago-aixplain . Hope all is well. Apologies for the delay, but I have been busy with other urgent projects with Dr. Kareem. I committed again with the following changes:

  • Rewrote the .py file as a class with the name LabelStudioFactory, placed it in aixplain.factories, and modified the formatting, and the method/parameter names accordingly.
  • Separated the part that processes the retrieved data, and downloads it to a .csv file in a class method called LabelStudioFactory.save_to_csv().
  • Added language as an optional parameter (list), where the user can enter the code of the languages they want in the metadata in a list. Examples: LabelStudioFactory.create(....,language = ['ar'] or ['en', 'es'], etc,.
  • Modified the notebook documentation, the colab notebook, and the user_doc.md to account for the above changes.

@thiago-aixplain
Copy link
Collaborator

Thank you @elsheikhams99 . I think I did have an optional comment in the code. Besides that, I think there is only the tests left. Could you please you write some functional tests on tests/functional? Follow the functional tests for the other assets as example

language: Optional[List[str]] = None,
corpus_description: Optional[str] = None
) -> Dict:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to move part of this code to labelstudio_functions.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants