-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New async & distributed ML backend #515
Comments
Controller can be tested here: Example request: {
"projectId": "ecos",
"jobId": "ea12ac70-288c-11ef-9ca5-00155d926c42",
"sourceImages": [
{
"id": "NScxODE3NzEyMwo=",
"url": "https://anon.erda.au.dk/share_redirect/DSWDMAO70L/ias/denmark/DK1/2023_07_05/20230705000135-00-07.jpg",
"eventId": "1234"
}
],
"pipelineConfig": {
"stages": [
{
"stage": "OBJECT_DETECTION",
"stageImplementation": "flatbug"
},
{
"stage": "CLASSIFICATION",
"stageImplementation": "mcc24"
}
]
},
"callback": {
"callbackUrl": "http://127.0.0.1:8080/example/callback",
"callbackToken": "1234"
}
} The callback can be inspected using Ngrok locally:
Image that causes a truncated request |
There are a few Refactorings I'm considering in the PipelineController.
|
@kaviecos Have you considered adding a status endpoint on the controller? It would be nice if a callback is missed, or if the job is taking a long time, for the client to be able to request the status of a request. PROCESSING, FAILED, WAITING, 2/6 complete, etc. Also will you document the types of failures and what those will look like in a callback? I just added these as subtasks as well. |
NOTES from call with @mihow and @kaviecos 2024-08-20
docs for OpenAI |
@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?) |
@mihow The current implementation actually allows for consumers hosted on other servers. The requirements are that they can make an outbound connection to the RabbitMQ server (port 5672). Then all communication with the controller can go through RabbitMQ. Of course they also need to be able to access the source images. When it comes to monitoring this also means that we need to monitor both the consumers and the stage implementations. And if we cannot make inbound http requests then we need to think that into the monitoring solution. Maybe a push-based solution (last_seen as you mentioned). It would be nice to know more about the restrictions - like, is it even possible to connect to RabbitMQ? |
@kaviecos and @mihow have designed & written the specifications for a new ML backend that orchestrates multiple types of models by different research teams, across multiple stages of processing, and is horizontally scalable. This expands on the current ML backend API defined here https://ml.dev.insectai.org/ by adding asynchronous processing, a controller & queue system, auth and many other production features.
The initial spec and notes are here, but are being re-written in the Aarhus GitLab wiki as the backend is developed.
https://docs.google.com/document/d/1caKxxfZhWhRi9Jfv9fy5fVeoM9bvhYPJ/
Docs in progress:
https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Getting-Started
https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Pipeline-Stages
https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Architecture-Overview
Known remaining tasks:
The text was updated successfully, but these errors were encountered: