Add changes for the labelstudio onboarding feature #64

elsheikhams99 · 2023-08-16T11:15:37Z

The Label Studio platform encompasses distinct projects, each distinguished by the Label Studio authentication API key employed. Within these projects, a variety of tasks are hosted, often spanning multiple data types. The aixplain.processes.data_onboarding.labelstudio_onboard method has been crafted to adeptly manage both audio and text data, whether originating from a Label Studio task or a project.

For successful utilization of this method, the requisites are straightforward: you must possess the relevant Label Studio project or task ID, as well as the corresponding Label Studio authentication API key.

krishnadurai

Take a look at the Dataset Factory: https://github.com/aixplain/aiXplain/blob/main/aixplain/factories/dataset_factory.py
Create a class like a Dataset Factory for LabelStudioData (choose an appropriate name). I want the section of your code on Jupyter NB to be looking like the following:

Assuming that the LABEL_STUDIO_KEY is set in an environment variable.

payload = labelstudiodata.create(
    data_name = data_name,
    project_id = project_id,
    onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?
    data_description = data_description
)

Instead of:

payload = labelstudio_onboard.auto_onboard(
    labelstudio_key = labelstudio_key,
    data_name = data_name,
    project_id = project_id,
    onboard = onboard,
    data_description = data_description
)

Please work towards this change in the Notebook. Secondly, clean up the output from the cells before committing for the time being.

elsheikhams99 · 2023-08-17T11:39:18Z

onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?

When I was writing the program, I thought that the user may want to just retrieve the data and save it to a .CSV file (that's what happens if onboard=False), to manually review before onboarding/use it for other purposes. If you think that it won't be beneficial, let me know and I can remove it.

For example, the labelstudio auto onboarding script doesn't specify the aixplain.enums.Language of each column. It's not possible to do that unless it's interactive. So, the user may set onboard to False, and manually create the metadata for the columns in the dataset, so that they could specify the aixplain.enums.Language.

krishnadurai · 2023-08-17T12:10:01Z

onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?

When I was writing the program, I thought that the user may want to just retrieve the data and save it to a .CSV file (that's what happens if onboard=False), to manually review before onboarding/use it for other purposes. If you think that it won't be beneficial, let me know and I can remove it.

You should look to introduce a function in LabelStudioData to download this data as a CSV separate from the create function. It is beneficial to the user, although it shouldn't be a part of the create function. You can compose your create and download functions by using sub-functions to reuse code.

For example, the labelstudio auto onboarding script doesn't specify the aixplain.enums.Language of each column. It's not possible to do that unless it's interactive. So, the user may set onboard to False, and manually create the metadata for the columns in the dataset, so that they could specify the aixplain.enums.Language.

Can you design your flow in such a way that the language metadata addition for the create function is part of your on-boarding flow? I understand that there are couple of options:

Make this flow interactive with the user supplying the metadata in one of the steps while or before create is called.
Interpret or detect the languages automatically in the create functionality. Language detection should be a simple addition.

cc @thiago-aixplain

thiago-aixplain · 2023-08-17T15:34:19Z

@elsheikhams99

aixplain.processes is supposed to be abstracted from the user. The idea of the SDK is to handle aiXplain assets, available in aixplain.modules. The CRUD (create, read, update and delete) operations to the assets are done by "asset factories", available in aixplain.factories. I would recommend to move your auto_onboard method from processes to a LabelStudioFactory renamed to create as @krishnadurai suggested. The goal of this method is to create a LabelStudioProject asset with project id, tasks, data and any other information from a LabelStudioProject; or a Corpus asset from aiXplain:

LabelStudioFactory.create(
    labelstudio_key: str,
    corpus_name: str,
    task_id: Optional[int] = None,
    project_id: Optional[int] = None,
    columns_to_drop: Optional[List[str]] = None,
    onboard: Optional[bool] = False,
    corpus_description: Optional[str] = None
) -> Union[Corpus, LabelStudioProject]:

thiago-aixplain · 2023-08-17T15:13:12Z