-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add changes for the labelstudio onboarding feature #64
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the Dataset Factory: https://github.com/aixplain/aiXplain/blob/main/aixplain/factories/dataset_factory.py
Create a class like a Dataset Factory for LabelStudioData (choose an appropriate name). I want the section of your code on Jupyter NB to be looking like the following:
Assuming that the LABEL_STUDIO_KEY is set in an environment variable.
payload = labelstudiodata.create(
data_name = data_name,
project_id = project_id,
onboard = onboard, # why have this option, one would only call this function if they want to onboard, no?
data_description = data_description
)
Instead of:
payload = labelstudio_onboard.auto_onboard(
labelstudio_key = labelstudio_key,
data_name = data_name,
project_id = project_id,
onboard = onboard,
data_description = data_description
)
Please work towards this change in the Notebook. Secondly, clean up the output from the cells before committing for the time being.
When I was writing the program, I thought that the user may want to just retrieve the data and save it to a .CSV file (that's what happens if onboard=False), to manually review before onboarding/use it for other purposes. If you think that it won't be beneficial, let me know and I can remove it. For example, the labelstudio auto onboarding script doesn't specify the |
You should look to introduce a function in LabelStudioData to download this data as a CSV separate from the create function. It is beneficial to the user, although it shouldn't be a part of the create function. You can compose your create and download functions by using sub-functions to reuse code.
Can you design your flow in such a way that the language metadata addition for the
|
|
output = response.content | ||
return output | ||
|
||
def getAllTasksPerProject(url, header, project_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow snake case standard:
get_all_tasks_per_project
.
Moreover: use typing conventions:
get_all_tasks_per_project(url: Text, header: Dict, project_id: Text)
Returns: | ||
list: A list containing task IDs of all tasks in the specified project. | ||
""" | ||
output = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output = []
output.append(taskInfo['id']) | ||
return output | ||
|
||
def extractData(dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake case and typing
else: | ||
return {'Error': 'No data'} | ||
|
||
def getAllDataForAProject(url, header, project_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake case and typing
|
||
def auto_onboard( | ||
labelstudio_key: str, | ||
data_name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "data", use corpus
to refer to the data collection so that:
data_name -> corpus_name
data_description -> corpus_description
docs/user/user_doc.md
Outdated
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-FYTtyVaDxyVv7kGCaMEd5E3uiHYHRTt?usp=sharing) | ||
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-FYTtyVaDxyVv7kGCaMEd5E3uiHYHRTt?usp=sharing) | ||
|
||
### Label Studio Corpus Onboarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate documentation with Nur please
@krishnadurai @thiago-aixplain I will work on those changes. |
Hello @krishnadurai , and @thiago-aixplain . Hope all is well. Apologies for the delay, but I have been busy with other urgent projects with Dr. Kareem. I committed again with the following changes:
|
…ioData module, and modified documentation
Thank you @elsheikhams99 . I think I did have an optional comment in the code. Besides that, I think there is only the tests left. Could you please you write some functional tests on |
language: Optional[List[str]] = None, | ||
corpus_description: Optional[str] = None | ||
) -> Dict: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to move part of this code to labelstudio_functions.py
The Label Studio platform encompasses distinct projects, each distinguished by the Label Studio authentication API key employed. Within these projects, a variety of tasks are hosted, often spanning multiple data types. The
aixplain.processes.data_onboarding.labelstudio_onboard
method has been crafted to adeptly manage both audio and text data, whether originating from a Label Studio task or a project.For successful utilization of this method, the requisites are straightforward: you must possess the relevant Label Studio project or task ID, as well as the corresponding Label Studio authentication API key.