Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on the microservice architecture for generating Semantic Search embeddings #4530

Open
albertisfu opened this issue Oct 3, 2024 · 24 comments

Comments

@albertisfu
Copy link
Contributor

@mlissner @legaltextai As we agreed here we can discuss about the architecture for the microservice to generate the embeddings required for semantic search.

From my understanding we'd require two services for processing embedding:
Synchronously: To generate query embeddings so they can be used on search time.
Asynchronously: To generate opinion text embeddings.

Correct me if I'm wrong but based on my reading about embedding generation. GPUs are fast and efficient for processing large batches of embeddings due to their ability to leverage SIMD processing. However, processing small batches or a single embedding at a time may not be an efficient use of GPU resources.

If that’s correct, we need to determine the ideal batch size to efficiently utilize GPU resources. Some benchmarking may be required for this.

Here are some numbers for the model we're going to use:
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
Screenshot 2024-10-03 at 1 19 28 a m

According to this, the throughput on a V100 GPU is 4,000 queries per second, while on a CPU, it's 170 queries per second.

I assume the GPU throughput refers to batch processing, while the CPU throughput refers to single-thread processing. If that’s the case, unless we expect a high volume of search queries (enough to allow us to accumulate thousands of queries depending on the batching threshold we determine to process them in real-time), using a GPU for real-time search could be inefficient.

Therefore, an alternative could be to run the synchronous embedding service on CPUs instead of GPUs. This may be fast enough for generating query embeddings by utilizing multi-core CPU instances.

The general architecture for this service would look like this:

sync_embedding

One advantage is that this service can be scaled horizontally.

Asynchronously service: To generate opinion text embeddings.

For processing opinion text embeddings, we can take advantage of GPU batch processing. We'll still need to determine the ideal batch size, but considering the large volume, holding the opinion texts in memory and sending them over the network to the async embedding generation service might be inefficient.

An alternative approach could be to have a Django command that extract texts from the database that need embedding generation, and store them in a batch JSON file on S3. Initially, we can retrieve all opinion texts for the first generation, and based on the frequency we want to keep the embeddings in sync with the database, we can use the date_created field to extract only new opinion texts. We can also use pg-history tables to identify opinions where the text has changed, and send updates for their embedding update.

Then, the batch file IDs can be sent to the service, which will store them in a Redis queue for processing control. When the Celery queue is small enough, it can pull out a batch file ID, download the texts from S3, split them into chunks, hold them in memory, and send them in batches to the GPU for embedding generation. Afterward, the embeddings can be stored in a separate S3 bucket, where they can be retrieved by another command for indexing into Elasticsearch.

The architecture for this service would look like this:

async_embedding

it'll use a celery task or chain of tasks to:

  • Load the texts from S3.
  • Split texts into chunks.
  • Request embeddings for chunks in batches.
  • Store the embeddings into S3.
  • Remove the processed batch file ID from the queue.

I saw that GPU EC2 instances can have multiple GPUs. The idea is to have an equal number of Celery workers, each processing batches in parallel, according to the number of available GPUs.

  • Web framework: We could use either FastAPI or Django. Both support Celery. FastAPI is lighter than Django, which could be an advantage.
    • However, using Django may be quicker in terms of Dockerization and the celery-redis environment setup, as we already have more experience with it.
  • Dockerization: We can have the sync and async embedding microservices in a single repository and Docker service for development. However, if we decide to use different types of instances (CPU/GPU), we would need to deploy them with different node affinities and be able to scale them independently. I'm not sure if this would require two separate Docker services in Kubernetes, or if this is achievable within a single service.

Some additional questions from my side:

  • Do we have an idea of how often we want to keep opinion text embeddings updated from the database? Is it important to have them in sync as quickly as possible, or can we afford a delay of hours, days, or weeks? This will help us plan a strategy for the initial embedding generation and updates, and determine if we need GPU instances running all the time.
  • Are we considering embedding versioning on S3? Should we preserve old versions for some time before deletion, or only keep the latest version of each embedding?
  • Do we have an idea of what the ideal batch size would be to get the most out of GPU resources?
  • Do you know if running this model on CPUs would have any drawbacks compared to GPUs, besides speed? Benchmarking will be needed to confirm that it’s fast enough for real-time query embedding.

Let me know your thoughts, and if you have any additional questions or suggestions.

@mlissner
Copy link
Member

mlissner commented Oct 3, 2024

Interesting research, thanks Alberto. I didn't realize that we can use CPUs for the queries. That's great news.

Do we have an idea of how often we want to keep opinion text embeddings updated from the database?

As quickly as possible. I was hoping to do it in real time, like we do with keyword search.

Are we considering embedding versioning on S3? Should we preserve old versions for some time before deletion, or only keep the latest version of each embedding?

I think we can throw away old embeddings, BUT if we switch embedding models we should probably keep the embeddings for prior models. I think this just means we put the model name into the S3 path.

Do we have an idea of what the ideal batch size would be to get the most out of GPU resources?

I certainly do not, but I'd love to see the APIs in sentence transformers to see how they work. How do batches work in this world? Is it just that we send lots of things very quickly to the GPU, or do you need to send them in some sort of batch?

My understanding of how this works is pretty different than yours, Alberto, but I'm just going on intuition, so I'm probably wrong. But my assumption was that GPUs are fast because they can process individual requests in parallel, not because they process multiple requests in parallel. Interesting.

Do you know if running this model on CPUs would have any drawbacks compared to GPUs, besides speed?

I don't, no.

Web framework...

Yeah, I usually have a strong preference for Django, but FastAPI has gotten very popular very fast, and it is a leaner, meaner tool for this kind of thing. I'm open to it, but I'm half-convinced that we'd be better off doing what we know.


Generally, I think your architecture looks good, but I think my hope would be to do it all synchronously:

  • Could we have a microservice that uses a CPU sometimes and a GPU other times?

    Right now, we'd start it on a machine with a GPU, and we'd use that to do all our batch work efficiently. Once that's done, we'd move the pod to a machine that uses CPUs instead. If we're lucky a couple CPU pods can run the models just fine on a day-to-day basis.

  • Could we do away with storing the text on S3? I'm not sure I understand what we gain by doing that.

  • Can we design the celery tasks so that sending the embeddings to Elastic is optional?

    My thought is that we can have one task. Right now, it just saves the embeddings to S3, and we have a separate django command to pull the embeddings and put them in Elastic. Later, once the batch work is done, the celery tasks save to S3 and push to Elastic.

@legaltextai
Copy link
Contributor

legaltextai commented Oct 3, 2024

Thanks Alberto. What I would do in the past is split the decisions into 350 word chunks and send the chunks for embedding in batches of 50. I can share that code, if you need it. That worked just fine on my Tesla v100 GPU (16gb). From my experience, embedding with CPU is much slower, but we can run an experiment in our instance. Please let me know if you need anything from me.

@mlissner
Copy link
Member

mlissner commented Oct 3, 2024

Can you shed any light on how GPUs perform batching, @legaltextai?

@legaltextai
Copy link
Contributor

legaltextai commented Oct 3, 2024

as in code wise? speed wise? or processor wise? i think the batching, as in chunking texts + combining + sending to api, is done by cpu but i am not super knowledgeable about the division b/w cpu and gpu tasks by default.

@mlissner
Copy link
Member

mlissner commented Oct 3, 2024

I'm trying to understand how the GPU performs. Alberto said that it needs batches to use SIMD, so I'm trying to understand how that works.

@albertisfu
Copy link
Contributor Author

As quickly as possible. I was hoping to do it in real time, like we do with keyword search.

Got it. In that case, I think we’ll need to generate a large initial embedding and index it into Elasticsearch, which will take some time (this can be done via a command as described above). After that, we can handle new opinions and updates by triggering them through signals, just as we do for regular indexing.

I think we can throw away old embeddings, BUT if we switch embedding models we should probably keep the embeddings for prior models. I think this just means we put the model name into the S3 path.

Sounds good. We can just override embeddings on updates.

I certainly do not, but I'd love to see the APIs in sentence transformers to see how they work. How do batches work in this world? Is it just that we send lots of things very quickly to the GPU, or do you need to send them in some sort of batch?

Well, in terms of the Sentence Transformer library, the encode method has a batch_size parameter that defaults to 32:

embeddings = model.encode(chunked_texts, batch_size=32)

You send a list of chunks, and the encode method will take care of sending them to the GPU in batches of 32 chunks. As I understand it, if the chunked_texts list is longer than the batch_size, batches for all the chunks will be processed sequentially.

My understanding of how this works is pretty different than yours, Alberto, but I'm just going on intuition, so I'm probably wrong. But my assumption was that GPUs are fast because they can process individual requests in parallel, not because they process multiple requests in parallel. Interesting.

Yeah, I think that's right. A single operation can be parallelized, allowing it to finish faster. However, it depends on the size of the task you send. This blog post is interesting and includes some benchmarks.

It mentions using a book of 1,000 pages split into chunks of 900 characters, resulting in around 5,000 chunks.

Then, it assesses the performance and memory usage across various models:

1_EY8YBGEybSkXKnkJxez8Ww

They concluded that, depending on the model size, there is an ideal batch size where performance peaks and memory usage is balanced.

Processing batches of size 1 shows that it takes the most time to complete the whole task because the GPU resources are underutilized. In this task and with these models, the optimal batch size is around 5. Increasing the batch size not only does not show a boost in performance but sometimes results in worse performance, while memory usage increases.

So selecting a batch size is crucial, and it depends on the model and hardware used. Sergei mentioned that using chunks of 350 and batches of 50 worked well on the hardware he used. So, that can be a starting point, and we could perform some additional benchmarks to adapt the batch size based on the type of computing instances we’re going to use.

Generally, I think your architecture looks good, but I think my hope would be to do it all synchronously:

Do you mean that we may avoid using Celery if we can?
Does that also include the initial batching work?

I think using Celery for processing the initial embeddings makes sense only if we plan to use an EC2 instance with multiple GPUs, each running the model. In that way, workers can divide the work across the different available GPUs reliably.

Could we have a microservice that uses a CPU sometimes and a GPU other times?

Right now, we'd start it on a machine with a GPU, and we'd use that to do all our batch work efficiently. Once that's done, we'd move the pod to a machine that uses CPUs instead. If we're lucky a couple CPU pods can run the models just fine on a day-to-day basis.

Yeah, I think that's possible. We’d just need to get the right settings to deploy models either on a GPU or CPU. Also, the pattern for processing work might change, using batches for GPU processing and sending concurrent requests for the CPU.

I think the microservice can have two endpoints: one for batch processing on the GPU and one for single requests on the CPU. This setup can work well for embedding search queries and opinion texts, but some benchmarking would be required to assess whether the CPU would be fast enough for processing large texts in a reasonable time.

Could we do away with storing the text on S3? I'm not sure I understand what we gain by doing that.

Yeah, I was thinking of storing texts on S3 because I initially thought an ideal batch size for efficient use of GPU resources would be a large number, which could mean that a batch of texts could size up to gigabytes of data. That would result in a huge HTTP request. However, now that it seems the ideal batch size can be lower than 50, I don't see a problem with just sending texts via HTTP directly to the microservice.

Can we design the celery tasks so that sending the embeddings to Elastic is optional?

My thought is that we can have one task. Right now, it just saves the embeddings to S3, and we have a separate django command to pull the embeddings and put them in Elastic. Later, once the batch work is done, the celery tasks save to S3 and push to Elastic.

Yeah, sure. This makes sense to me. For batch processing, we use a Django command to index all the initial embeddings into ES, and then for new embeddings and updates, the same service can take care of indexing them into elasticsearch as they are generated.

@legaltextai
Copy link
Contributor

i agree with Alberto. let's run the test on CPU first for queries. will that instance be always on, with the model loaded into memory?

@albertisfu
Copy link
Contributor Author

will that instance be always on, with the model loaded into memory?

Yes, the idea is that if the CPU instance with the model loaded into memory is fast enough to embed queries at search time, that instance will always be on. If it is also fast enough to embed opinion texts, we can use it to generate embeddings on a daily basis and no longer need the GPU instance after the initial batch work is completed.

would you help us run the benchmarks on a CPU?

@legaltextai
Copy link
Contributor

legaltextai commented Oct 3, 2024

of course. do you want me to run the test on my server (with cpu / gpu) or will i need an access to aws instance?

@albertisfu
Copy link
Contributor Author

Unless Mike has a different opinion, I think it's okay to run the benchmark on your server, considering you can load the model into memory and execute the computations on the CPU. I believe that will give us an idea of how it performs on the CPU. If you share the resources you used on your server, we can then select something similar on EC2.

The idea behind this test is to measure the throughput per second for query embedding generation of the model on the CPU. We can consider an average query size. Currently, we don't have a defined average size for queries since we are not logging them yet. However, a couple of hundred characters might be a good starting point. I assume the query size will fit in a single chunk with a batch_size of 1.

It would also be great if you could also measure the model’s throughput on the CPU using a large opinion text. Testing with the chunk size you used on the GPU and experimenting with different batch sizes. However, I'm not sure if the batch_size is useful for CPU testing. Due to the way the CPU handles embedding computations, varying it might not be significant.

@legaltextai
Copy link
Contributor

legaltextai commented Oct 6, 2024

Here are my results for CPU vs GPU:

Task CPU (time/avg) GPU
Queries 0.0772s / 0.0077s 0.0223s / 0.0022s
Paragraphs 0.6064s / 0.0606s 0.0458s / 0.0046s

Here are the specs for my CPU and GPU:

Processor: x86_64
Physical cores: 28
Total cores: 56
CPU Usage: 8.7%
Max Frequency: 3600.00Mhz
Current Frequency: 1362.89Mhz

RAM Information:
Total: 125.76 GB
Available: 27.67 GB
Used: 48.13 GB
Percentage: 78.0%

GPU Information:
GPU Available: Yes
Number of GPUs: 1
GPU 0: Tesla V100-PCIE-16GB
GPU 0 Memory: 15.77 GB

Here is the notebook, if you would like to replicate in your instance

@mlissner
Copy link
Member

mlissner commented Oct 7, 2024

Generally, I think your architecture looks good, but I think my hope would be to do it all synchronously:

Do you mean that we may avoid using Celery if we can?

Oh, no, I just mean that we should take out the step of putting the text on S3 before processing it. I think Celery is probably a good tool for parallelizing things.

Does that also include the initial batching work?

I imagine Celery would be a good tool for this like usual, but the goal is to pull objects from the DB, send them to the microservice, and keep it saturated. The simplest way to do that is the goal, I think.

I think the microservice can have two endpoints: one for batch processing on the GPU and one for single requests on the CPU.

My hope was that we can just use the CPU after we've done the initial embeddings, so I'm hoping that one endpoint will work. It could take a list of chunked_texts, and return a list of embedding objects. If it has a GPU available, it uses that. If not, then it uses the CPU.

Do developers need to choose the CPU or GPU when making their call to the microservice?

Yeah, sure. This makes sense to me. For batch processing, we use a Django command to index all the initial embeddings into ES, and then for new embeddings and updates, the same service can take care of indexing them into elasticsearch as they are generated.

👍🏻

Yes, the idea is that if the CPU instance with the model loaded into memory is fast enough to embed queries at search time, that instance will always be on. If it is also fast enough to embed opinion texts, we can use it to generate embeddings on a daily basis and no longer need the GPU instance after the initial batch work is completed.

Exactly. 🎯

Here are my results for CPU vs GPU:

These are great, thanks! I think my takeaway is that we can definitely do queries on a CPU in real time, no problem. (Average of 0.007s is great!)

The per paragraph speed seems to be:

  • 0.06s on CPU, and
  • 0.005s on GPU.

So if a doc has 100 paragraphs, that's six seconds on the CPU and half a second on the GPU. I think that's fine for ongoing updates, and that we'll want the GPU for the initial indexing (no surprise).

What do you guys think?

@legaltextai
Copy link
Contributor

I am totally OK with GPU for embedding texts in batches and CPU for queries.
As long as we have the CPU of similar specs as mine. If not, we 'll need to run the tests again.
I presume the ES search phase should be constant, regardless how the embeddings were embedded.

@mlissner
Copy link
Member

mlissner commented Oct 7, 2024

Our CPUs should be fine, yep! Great.

@mlissner
Copy link
Member

mlissner commented Oct 7, 2024

Alberto, I think this means we've got our architecture in good shape? What else is on your mind?

@albertisfu
Copy link
Contributor Author

Great, the CPU looks quite promising. I'll refine the architecture diagrams according to your latest comments and come back so we can agree on them. So we can start discussing a plan for its implementation.

@albertisfu
Copy link
Contributor Author

Based on your comments and suggestions, we can have an embedding microservice that works synchronously, performing only two tasks: splitting texts into chunks and generating the embeddings.

The goal is for the sentence-transformer model to run on either a GPU or a CPU. The GPU version can be used for the initial text embedding batch work, while the CPU version can handle daily work embedding queries and texts (if it can keep up with our workload).

The microservice can accept either one text to embed (for queries) or multiple texts for embedding opinions, in order to reduce the number of HTTP requests during batch work.
The response will be either a single embedding (for a query) or multiple opinion text chunks and their corresponding embeddings.

microservice-detail

The general idea is that this microservice can be scaled horizontally according to available resources. However I have some comments/questions to ensure the microservice scales properly and operates without bottlenecks:

  • GPU instances on EC2 can have from 1 to 8 GPUs. If we select an instance with only one GPU, it can run a single microservice pod. If we select an EC2 instance with 2 or more GPUs, what would be the best approach to use them? Let's say we have an instance with 2 GPUs. If I'm not mistaken regarding resource allocation in K8s, the simplest configuration would be to allocate two microservice pods to the instance, so each pod uses one GPU. But I think this decision will depend on the EC2 instance characteristics.
  • The alternative is to use the 2 GPUs in a single microservice instance. However, this will require a mechanism to parallelize GPU operations. I'm not sure if the sentence-transformer or PyTorch library can do that. Reading about it in this issue: Sentence Transformers: encode() hangs forever in uwsgi worker UKPLab/sentence-transformers#1318
    It seems that if having more than one worker, each of them will load the model into memory and try to perform the requests in parallel, which can lock the workers.
    If using PyTorch, they recommend setting OMP_NUM_THREADS=1, which will fix the locking issue. But this configuration probably won't take full advantage of multiple GPUs in the pod. This would require some testing.

Given this, probably the simplest solution to start with is to set 1 uvicorn worker per microservice instance with one GPU allocated and let the microservice scale horizontally while using a load balancer in front of instances.

For the CPU embedding processing, similar questions arise. It seems that we'll have the same issue of each worker trying to load the model into memory. If so, we can apply the same configuration described above of 1 worker per microservice and scale it horizontally. This will require allocating the right amount of RAM to load the model and do the processing work. We'd also need to decide the number of vCPUs to assign to the pod.

The alternative seems to be doing something to make the model shared in memory so it's available for all workers, and multiple CPUs can do embedding work using the same model in memory.

Initial batch work

For the initial batch work, the embedding generation/indexing architecture will look like this:

batch-embedding-work

We'll have a Django command that pulls Opinion texts from the database within a Celery task. It can retrieve multiple Opinion texts at once, allowing us to request embeddings for many Opinion texts in a single request, thus saving on HTTP requests. We'll need to determine an optimal number of texts per request.

Considering our microservice instances will have a single uvicorn worker, we'll need to ensure we don't send them more tasks than they can handle. We can solve this by setting up an equivalent number of Celery workers and using throttling based on the queue size.

The microservice will return a JSON response containing the opinion_id, text chunks, and the chunk embeddings for each Opinion. This JSON will be stored in S3 by the same Celery task or a different one.

We can then have a separate Django command that pulls the Opinion chunks and embeddings from S3 and indexes them into ES. Having this in a separate command and task will help us take advantage of ES bulk updates, allowing us to index many opinion embeddings in a single request. This can involve a different number of opinions than requested for embeddings, according to ES load, so it can be throttled at a different rate.

Day to day work

After the initial work, for day-to-day operations the idea is to integrate Opinion text embedding generation before the ES indexing work on the ES Signal processor. So when there's a new opinion where the text is not empty, or the text field changes, the embedding generation task will be included in the chain. The microservice will return the chunks and embeddings, which will be stored in S3 and indexed into ES in the following chained task.

daily-embedding-work

For text query embeddings, on every case law search request, the text query will be sent to the microservice for its embedding. The returned embedding will then be used to generate the ES semantic search request.

One thing to note is that we will need to prioritize text query embeddings over opinion text embeddings, as text queries will be used at search time.
To accomplish this, we could have a namespace in Kubernetes where we run Opinion text embedding pods equivalent to the number of Celery workers.
We could then have a different namespace with pods specific for text query embeddings. So they're prioritized over Opinion text embedding work which can take longer.

Let me know what you think.

@mlissner
Copy link
Member

mlissner commented Oct 9, 2024

probably the simplest solution to start with is to set 1 uvicorn worker per microservice instance with one GPU allocated and let the microservice scale horizontally while using a load balancer in front of instances.

Yes, sounds right to me.

The alternative seems to be doing something to make the model shared in memory so it's available for all workers, and multiple CPUs can do embedding work using the same model in memory.

Memory tends to be the thing we run out of and that we pay more for, so it's probably worth seeing how hard this is. But even if we find a way to have, say, 4 or 8 CPUs configured for a pod while only loading the model once, I don't think there's a way to auto-scale that except by adding another 4 or 8 CPUs at a time, so that doesn't really work unless we're tuning the CPU allocation by hand.

@legaltextai, do you know how much memory this model uses? I don't entirely understand your stats on that above?

We'll need to determine an optimal number of texts per request.

I'd guess this is less about how many opinions to do at once and more about how long those opinions are.

One thing to note is that we will need to prioritize text query embeddings over opinion text embeddings, as text queries will be used at search time.

I think once we're using the CPUs for this, k8s will scale things nicely for us. We just have to maintain enough overhead in our k8s configuration for the deployment such that when opinions are scraped, user queries still have responsive pods — I think!


Overall, I think we've got a plan here though, thanks. Alberto, do you want to write out the steps that we'd want to take for this?

@legaltextai
Copy link
Contributor

as i understand, the rough calculation for memory requirements goes smth like this: 109M (model size in our case) x 64 bit parameters (8 bytes) (for our model) x 1.5 x some overhead (20%?) -> ~ 2.7gb?

@albertisfu
Copy link
Contributor Author

albertisfu commented Oct 9, 2024

Memory tends to be the thing we run out of and that we pay more for, so it's probably worth seeing how hard this is. But even if we find a way to have, say, 4 or 8 CPUs configured for a pod while only loading the model once, I don't think there's a way to auto-scale that except by adding another 4 or 8 CPUs at a time, so that doesn't really work unless we're tuning the CPU allocation by hand.

I see, yeah scaling a POD with multiple CPUs might not be as efficient as we want. If we figure out how to make the model shared in memory for different workers, maybe we can start with pods with a small number of CPUs? Let's say 3: one for regular work and two for model embedding. At least it'll be better than having a POD that can only process one request at a time with the whole model loaded into memory

Overall, I think we've got a plan here though, thanks. Alberto, do you want to write out the steps that we'd want to take for this?

Of course, I'll describe the steps/parts that we need to build so you can decide how they should be prioritized and assigned. We can also create independent issues for them.

1 Embedding Microservice

This can be divided into 3 tasks.

1.1 Create Docker Skeleton for the Microservice

  • Decide between Django or FastAPI
  • Create Docker structure based on practices used in other projects we can use Doctor as base
  • Determine whether to add this microservice container to our docker-compose
    • Considering: Model size (~2.7GB as pointed out by @legaltextai ) may be too large for dev environments, so we have some options
      a. Is it possible to load a lightweight version of the model on development just for testing?
      b. Use a microservice mock for dev/testing purposes

1.2 Add Microservice Embedding Endpoints and embedding model setup

  • Implement API endpoint to:
    a. Split texts into chunks
    b. Request embeddings for the chunks

  • Consider:

    • Add and set up the library and model for embedding processing
    • Ensure model can run on either GPU or CPU based on available resources in the POD
    • Use a single worker to process HTTP requests
    • Validate request body and handle according to request type, I'd suggest:

    Query embeddings request body:

 {
     "type": "single_text",
     "content": "This is a query" 
   }

Opinion lists request:

   {
     "type": "opinions_list",
     "content":[
       {
         "opinion_id": 1,
         "text": "Lorem"
       },
       ...
     ]
   }
  • For the response I'd suggest:

Query embeddings response:

   {
     "type": "single_text",
     "embedding": [12123, 23232, 43545]
   }

Opinion lists response:

 {
     "type": "opinions_list",
     "embeddings": [
       {
         "opinion_id": 1,
         "chunks": [
           {
             "chunk": "Lorem",
             "embedding": [12445, ...]
           }
         ]
       },
       ...
     ]
   }

  • Error Handling
    Review and handle error types the embedding endpoint can return so we can differentiate between transient errors (e.g., ConnectionError) and bad requests with appropriate HTTP status codes, so we can decide on the client side whether to retry the request or not. For example:

    • Invalid request body format: 400 Bad Request
    • Unable to process embedding due to content issues: 422 Unprocessable Content
    • Set up reasonable timeout for embedding processing.
  • Authentication
    Determine if the microservice requires authentication. If for internal use only, authentication can be omitted (similar to Doctor)

1.3 Design and Deploy Microservice K8s Cluster/Instances

  • This task consist to create required K8s manifest files and deploy on AWS.
  • Considering:
    • Whether a new cluster is required or if it can be within the same courtlistener cluster
    • EC2 instance types for initial batch processing (GPU) and day-to-day work (CPU)
    • Resource allocation for each microservice pod:
      • Batch work: 1 GPU, RAM for workload (I understand on a GPU instance the model is loaded into the GPU memory instead of RAM), number VCPUs for API operations
      • Daily work: VCPUs for regular work and model embedding work. RAM to load the model and handle regular work.
    • Auto-scaling of microservice instances.
    • Setting up a load-balancer accessible by cl-python instances
    • Deploy the cluster or microservice instances

2 Django command for embedding batch work

This task consists of implementing the Django command to perform the initial batch work. It basically consists of two parts:
- A command that pulls up Opinion texts from the database and creates a batch of opinion texts to send to the microservice.

This is related to your comment:

I'd guess this is less about how many opinions to do at once and more about how long those opinions are.

I'm thinking that the command can only iterate over all the Opinion pks in DB so we can just send a chunk of pks to the Celery task that will have to do the work:

  • Request opinion texts from the pks passed.
  • Create the request body for these opinion texts
  • Send the request to the microservice
  • Wait for the microservice response
  • Store embeddings into S3

In that way, the task would need to only hold opinion pks. However, the disadvantage is that some requests can be just a few KBs while others can be many MBs.

  • The alternative is to request the opinion texts within the command and set a batch threshold that we determine performs better to generate embeddings in the microservice, say 1MB or 10MB, wherever we determine. So we extract opinion texts and when we're close to this limit, we pass the batch of texts to the task.
    • The disadvantage is that tasks will require holding opinion texts instead of retrieving them within the task.
    • But probably this is OK since we don't expect to have many tasks in the queue. Considering we'd just set up an equivalent number of Celery workers as we have microservice instances. If we use queue throttling, the queue should be small all the time.

3 Django command to load batch work embedding into ES

  • As a complement to the previous command, we'd require creating this command that pulls up embeddings from S3 and indexes them into ES using a bulk update operation with a reasonable batch size that ES can handle properly.
  • As part of this task, we can also tweak the Opinion index mapping to support the embeddings storage.
  • Evaluate ES workload to determine if the cluster can handle this load beside the regular index and other bulk tasks running related to the ES cluster.

4 ES Signal processor tweaks:

This can be done after having the previous commands since it can reuse the Celery task used for the command.

  • This will require tweaking the ES signal processor specifically on the save/update methods.
  • So if an Opinion is saved and it contains content for the text field or if it's updated and the text field changed, we should include the Celery task to request embedding generation for this opinion text to the chain as the first task to execute. The response of this task should be passed to the regular save/update ES task which will include other fields + the embeddings.

5 Text query embedding method:

The last piece is to create a method that can be called within the Case Law semantic search as a previous step to send the query to ES.
It'll be as simple as calling the embedding microservice synchronously and using the response to build the semantic ES query.
This method should be called for either the frontend and the Search API if we're also considering making semantic search available via the API.

Let me know what do you think.

@mlissner
Copy link
Member

mlissner commented Oct 9, 2024

This sounds great to me. @legaltextai, do you think you can split this off into smaller issues and start tackling each of these with Alberto's help?

@legaltextai
Copy link
Contributor

i can share the script that will take opinion_id + opinion_text from our postgres, split decisions into 350 word chunks, embed those and send to s3 storage with the opinion_id_number_of_chunk_model_name. i don't have an access to our s3, so i 'll leave those destination fields blank.
thinking out loud here, do we really need an api for this task?

@mlissner
Copy link
Member

mlissner commented Oct 9, 2024

Well, the API for doing query vectorizing will be almost identical to the one doing text vectorization, and we'll need that, so it seems like making it is the right thing to do.

@legaltextai
Copy link
Contributor

This sounds great to me. @legaltextai, do you think you can split this off into smaller issues and start tackling each of these with Alberto's help?

In terms of splitting into tasks, these are my ideas:

  • create fastapi embedding endpoint, get it up and running
  • write a script to: draw opinions from postgres, split them into chunks, send each chunk for embedding, receive the embeddings back, send to s3 as files with opinion_id, text_chunk, embedding.
  • create a new index on ES that will include vector fields
  • write a script to: take the content of s3 files and put them under the same opinion_id in elastic search

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants