-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to change distance model when using FastEmbed? #734
Comments
Hi @paluigi Methods like |
HI @joein , thanks for the reply. If I want to use FastEmbed then what would be the correct way to use the Distance.DOT metric in a collection? For what I can see, in my example above the collection would be initialized with the Distance.COSINE metric, even if I tried to set another metric. |
The set_model method does not have a distance parameter. def set_model(
self,
embedding_model_name: str,
max_length: Optional[int] = None,
cache_dir: Optional[str] = None,
threads: Optional[int] = None,
providers: Optional[Sequence["OnnxProvider"]] = None,
**kwargs: Any,
):
# Method body You can specify a different metric while creating a collection. But as far as I know you can not specify a default metric for all operations on the client. While creating the collection the vector name and the size of the vector field has to match the model specs. One way to do this would be to create a helper method using pieces from the FastEmbedMixin. import uuid
from qdrant_client import QdrantClient
from qdrant_client import models
from fastembed import TextEmbedding
def add_points(
collection_name, documents, ids=None, model_name=None, distance=models.Distance.DOT
):
# We could also pass in the client as a param
client = QdrantClient(path="./db/")
if model_name is not None:
client.set_model(model_name)
# Get the vector field name and the vector size for the chosen model.
# Using the exact name is important because client.query() looks at the
# {vector_field_name} vector.
vector_field_name = client.get_vector_field_name()
vector_params = client.get_fastembed_vector_params()
# Create the collection if it does not exist.
if not client.collection_exists(collection_name):
client.create_collection(
collection_name=collection_name,
vectors_config={
vector_field_name: models.VectorParams(
size=vector_params[vector_field_name].size,
distance=distance,
)
},
)
# Load the embedding model from FastEmbed.
if model_name is not None:
embedding_model = TextEmbedding(model_name)
else:
embedding_model = TextEmbedding()
# Create a generator for UUIDs, if ids are not passed.
if ids is None:
ids = iter(lambda: uuid.uuid4().hex, None)
elif type(ids) is list:
ids = iter(ids)
# Upload points
client.upload_points(
collection_name=collection_name,
points=[
models.PointStruct(
id=next(ids),
vector={
vector_field_name: embedding,
},
)
# Embed the documents using the embedding_model
for embedding in embedding_model.embed(documents)
],
)
documents = [
"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.",
"fastembed is supported by and maintained by Qdrant.",
]
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
add_points(
collection_name="test_collection",
documents=documents,
model_name=model_name,
) Because we have followed the naming conventions we can use the query method out of the box. client.set_model(model_name)
search_result = client.query(
collection_name="test_collection",
query_text=query_text,
) This is a very minimal implementation and there might be better ways to do this. But it could be modified to suit your purpose. |
Issue
Not able to change distance model when creating a collection with FastEmbed.
Minimal steps to reproduce
Result
{'fast-paraphrase-multilingual-mpnet-base-v2': VectorParams(size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}
Expected result
Distance should be Distance.DOT
Environment
OS: Ubuntu 22.04.1
qdrant-client==1.10.1
fastembed==0.3.4
More details
When creating a collection with Sentence Transformers I am able to set a different distance model (DOT, EUCLID). With FastEmbed it seems the distance is only cosine. Also looking to the code, it seems all models are initialized with cosine distance only.
The text was updated successfully, but these errors were encountered: