Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load EcoPlaftform data from S3 to QDrant #6

Open
aduverger opened this issue Oct 1, 2024 · 0 comments
Open

Load EcoPlaftform data from S3 to QDrant #6

aduverger opened this issue Oct 1, 2024 · 0 comments

Comments

@aduverger
Copy link
Member

High-level design:
image

Overall logic:
Each time a pdf file is created or modified in the epd-raw-data-prod-eu-west-3 bucket, it sends an event to a SNS.
A SQS listens to this SNS and triggers a lambda. This lambda will call an Embedding model to create vector representations of the PDF:

  • One BERT-like embedding of the product name
  • One ColBERT embedding of the product name
  • One BERT-like embedding of the product description
  • One ColBERT embedding of the product description

For v0, we will focus only on BERT-like embeddings of the product name and use a pre-trained model from either Mistral or Voyage AI APIs

UUID can be used as a primary key for the DB.
Maybe for v0 we can load everything into the same table.

Resources to create with cdk:

  • S3 Event notifications (modify existing S3 bucket)
  • SNS Topic
  • SQS + Dead-Letter Queue
  • Python Lambda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant