Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New async & distributed ML backend #515

Open
7 of 20 tasks
mihow opened this issue Aug 15, 2024 · 6 comments
Open
7 of 20 tasks

New async & distributed ML backend #515

mihow opened this issue Aug 15, 2024 · 6 comments
Assignees

Comments

@mihow
Copy link
Collaborator

mihow commented Aug 15, 2024

@kaviecos and @mihow have designed & written the specifications for a new ML backend that orchestrates multiple types of models by different research teams, across multiple stages of processing, and is horizontally scalable. This expands on the current ML backend API defined here https://ml.dev.insectai.org/ by adding asynchronous processing, a controller & queue system, auth and many other production features.

The initial spec and notes are here, but are being re-written in the Aarhus GitLab wiki as the backend is developed.
https://docs.google.com/document/d/1caKxxfZhWhRi9Jfv9fy5fVeoM9bvhYPJ/

Docs in progress:
https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Getting-Started
https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Pipeline-Stages
https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Architecture-Overview

pipeline_controller_flow drawio

Known remaining tasks:

@mihow
Copy link
Collaborator Author

mihow commented Aug 15, 2024

Controller can be tested here:
https://preview.ami.ecoscience.dk/swagger-ui/index.html#/pipeline-controller/createRequest

Example request:

{
"projectId": "ecos",
"jobId": "ea12ac70-288c-11ef-9ca5-00155d926c42",
"sourceImages": [
{
"id": "NScxODE3NzEyMwo=",
"url": "https://anon.erda.au.dk/share_redirect/DSWDMAO70L/ias/denmark/DK1/2023_07_05/20230705000135-00-07.jpg",
"eventId": "1234"
}
],
"pipelineConfig": {
"stages": [
{
"stage": "OBJECT_DETECTION",
"stageImplementation": "flatbug"
},
{
"stage": "CLASSIFICATION",
"stageImplementation": "mcc24"
}
]
},
"callback": {
"callbackUrl": "http://127.0.0.1:8080/example/callback",
"callbackToken": "1234"
}
}

The callback can be inspected using Ngrok locally:

  • run ngrok http 2222
  • update the callbackUrl in the sample request to the generated ngrok URL
  • open web interface http://localhost:4040
  • trigger the request to the ML API controller

Image that causes a truncated request
https://static.dev.insectai.org/ami-trapdata/Panama/E43B615A/20231113004009-snapshot.jpg

@kaviecos
Copy link
Collaborator

kaviecos commented Aug 15, 2024

There are a few Refactorings I'm considering in the PipelineController.

  1. Version on Detections in the callback is not set. The reason is that it's not obvious which value to set it to. The bounding box is done by flatbug, each classification is done by another classifier and cnn-features and crops might be from other stages.
  2. Stage in the PipelineConfig is currently unused by the controller. I originally included it because I thought different stages might be handled differently but I think the current solution is cleaner - every stage use the same interface. One problem with handling different stages differently is that one stage implementation may perform multiple operations on the image.
  3. ImageCrop - Originally I thought the pipeline as ObjectDetection -> ImageCrop -> Classification. But I realized that we don't need to crop in order to do classification. Actually it performs a lot better if the entire sourceimage is used with bounding boxes. So maybe cropping is post-pipeline step? It can also be added as a stage and set the cropUrl on the detection.

@mihow
Copy link
Collaborator Author

mihow commented Aug 20, 2024

@kaviecos Have you considered adding a status endpoint on the controller? It would be nice if a callback is missed, or if the job is taking a long time, for the client to be able to request the status of a request. PROCESSING, FAILED, WAITING, 2/6 complete, etc.

Also will you document the types of failures and what those will look like in a callback?

I just added these as subtasks as well.

@mihow
Copy link
Collaborator Author

mihow commented Aug 20, 2024

NOTES from call with @mihow and @kaviecos 2024-08-20

  • CNN features are stored to a generic "URI" field (could reference a remote storage, or s3:// or vector database or file)
  • Add tests for dummy detector that returns many detections, and no detections - in the controller repo but also a good idea in the AMI Platform to add tests with mock responses.
  • Consider seeing how many 7mb requests the controller can take and send back (to callback) (load test)
  • Michael consider attempting a dev setup in one compose file for local dev & testing of the AMI Platform (single stage everything & multi stage setups)
  • If fetching the same image multiple times is a problem, then our current recommendation is to put a proxy cache in front of the local network where the stages are running.
  • Consider a cancellation method
  • Change single source image to array of images, even though we are using one
  • Add "stageParams": {} to the controller request, which are then passed via URL to the stage implementation requests.
  • Update the registry to support a "*" projectID that any project can use
  • Think about an endpoint for checking what stages are registered. Use the projectID to determine if it's a private or shared stage implementation.
  • Think about a way to see if a stage implementation is online or when it was last seen - Add some sort of basic healthcheck endpoint (see kubernetes /livez & /readyz pattern)? or can we check when a message was last taken from the queue by a stage?

docs for OpenAI
https://platform.openai.com/docs/guides/batch/getting-started

@mihow
Copy link
Collaborator Author

mihow commented Aug 20, 2024

@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)

@kaviecos
Copy link
Collaborator

@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)

@mihow The current implementation actually allows for consumers hosted on other servers. The requirements are that they can make an outbound connection to the RabbitMQ server (port 5672). Then all communication with the controller can go through RabbitMQ. Of course they also need to be able to access the source images.

When it comes to monitoring this also means that we need to monitor both the consumers and the stage implementations. And if we cannot make inbound http requests then we need to think that into the monitoring solution. Maybe a push-based solution (last_seen as you mentioned). It would be nice to know more about the restrictions - like, is it even possible to connect to RabbitMQ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants