Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Django command for embedding batch work #4559

Open
albertisfu opened this issue Oct 11, 2024 · 0 comments
Open

Develop Django command for embedding batch work #4559

albertisfu opened this issue Oct 11, 2024 · 0 comments

Comments

@albertisfu
Copy link
Contributor

albertisfu commented Oct 11, 2024

This task consists of implementing the Django command to perform the initial batch work. It basically consists of two parts:

  • A command that pulls up Opinion texts from the database and creates a batch of opinion texts to send to the microservice.

This is related to your comment:

I'd guess this is less about how many opinions to do at once and more about how long those opinions are.

I'm thinking that the command can only iterate over all the Opinion pks in DB so we can just send a chunk of pks to the Celery task that will have to do the work:

  • Request opinion texts from the pks passed.
  • Create the request body for these opinion texts
  • Send the request to the microservice
  • Wait for the microservice response
  • Store embeddings into S3

In that way, the task would need to only hold opinion pks. However, the disadvantage is that some requests can be just a few KBs while others can be many MBs.

  • The alternative is to request the opinion texts within the command and set a batch threshold that we determine performs better to generate embeddings in the microservice, say 1MB or 10MB, wherever we determine. So we extract opinion texts and when we're close to this limit, we pass the batch of texts to the task.
    • The disadvantage is that tasks will require holding opinion texts instead of retrieving them within the task.
    • But probably this is OK since we don't expect to have many tasks in the queue. Considering we'd just set up an equivalent number of Celery workers as we have microservice instances. If we use queue throttling, the queue should be small all the time.

The output of this issue would be:

  • A PR that includes the Django command to request opinion texts in batches.

Some questions:

@legaltextai, in terms of performance and efficient use of resources on our GPU/CPU machines used for embedding work, would it affect performance to have requests with varying amounts of text to embed? For example, sometimes we send 1MB for embedding, and other times 100KB. Is it more efficient to always request a fixed amount of text for embedding?

If so, is it possible to determine the ideal size of text to request in a single embedding request?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant