Unify embedding model / tokenizer for builtin source storages? #71

pmeier · 2023-10-12T20:22:17Z

We currently used different embedding models and tokenizers for our builtin source storages:

Chroma:

Lines 39 to 42 in e851297

    
           self._embedding_function = ( 
        
               chromadb.utils.embedding_functions.DefaultEmbeddingFunction() 
        
           ) 
        
           self._tokenizer = tiktoken.get_encoding("cl100k_base")

LanceDB

ragna/ragna/source_storage/_lancedb.py

Line 36 in e851297

self._model = SentenceTransformer("paraphrase-albert-small-v2")

ragna/ragna/source_storage/_lancedb.py

Lines 51 to 52 in e851297

def _embed(self, batch):

return [self._model.encode(sentence) for sentence in batch]

So far I've just used whatever the documentation of the respective tool suggested. For Chroma that wasn't really an issue. However for LanceDB, added in #66, this added tons of heavy dependencies:

ragna/ragna/source_storage/_lancedb.py

Lines 21 to 25 in e851297

    
           PackageRequirement("lancedb>=0.2"), 
        
           # FIXME: re-add this after https://github.com/apache/arrow/issues/38167 is 
        
           #  resolved. 
        
           # PackageRequirement("pyarrow"), 
        
           PackageRequirement("sentence-transformers"),

Since we build ragna.builtin_config at import time

ragna/ragna/__init__.py

Lines 35 to 38 in e851297

    
           for module, cls in [(source_storage, SourceStorage), (assistant, Assistant)]: 
        
               for obj in module.__dict__.values(): 
        
                   if isinstance(obj, type) and issubclass(obj, cls) and obj.is_available(): 
        
                       builtin_config.register_component(obj)

and PackageRequirement.is_available() performs the import, we now have a crazy overhead:

With LanceDB

$ time python -c "import ragna"
python -c "import ragna"  6,23s user 2,41s system 124% cpu 6,930 total

Without LanceDB

$ time python -c "import ragna"
python -c "import ragna"  2,28s user 1,63s system 156% cpu 2,497 total

Without LanceDB and Chroma

$ time python -c "import ragna"
python -c "import ragna"  1,26s user 0,20s system 99% cpu 1,458 total

That in itself wouldn't be the issue if the specific embedding model / tokenizer is required for LanceDB. But it isn't.

My proposal is twofold:

We should use the same embedding model / tokenizer for all builtin source storages. Since the source storages basically just store vectors, I currently can't imagine a case where one would require a specific configuration. Even then, providing the same for all other source storages means that we keep our dependencies to a minimum and in turn also the import time.
Instead of using a "random" embedding model / tokenizer, we should lightweight ones. The ones used by Chroma look like a good starting point, but maybe we can do better? @dillonroach do you have insights here?

The text was updated successfully, but these errors were encountered:

dillonroach · 2023-10-12T21:15:13Z

Per our side conversation - https://huggingface.co/spaces/mteb/leaderboard has a set of benchmarks they run against a number of these.. as they say 'your mileage may vary' but it's a decent starting point - the bge, and https://huggingface.co/BAAI/bge-small-en-v1.5 in particular, jump out as setting a good balance between performance and size.

If the goal is to match what's used else-where, encoding = tiktoken.get_encoding("cl100k_base") is the 'default' for GPT4/3.5-turbo and text-embedding-ada-002; and it's also worth noting that the tokenizer bundled with llama models is a BPE model based on sentencepiece. There's some good work done in the latest transformers release specific to tokenizers, https://github.com/huggingface/transformers/releases/tag/v4.34.0, and one can go digging there for some of the latest changes.

pmeier · 2023-10-25T11:04:42Z

With #72 we no longer have the massive overhead at import, but still we pulling in multiple GBs as dependencies. A docker image based on python:3.11 is ~6GB big.

The folks over at chroma had the same issue and solved it:

# In order to remove dependencies on sentence-transformers, which in turn depends on
# pytorch and sentence-piece we have created a default ONNX embedding function that
# implements the same functionality as "all-MiniLM-L6-v2" from sentence-transformers.
# visit https://github.com/chroma-core/onnx-embedding for the source code to generate
# and verify the ONNX model.

pmeier mentioned this issue Oct 16, 2023

Config / CLI refactoring #72

Merged

pmeier mentioned this issue Oct 25, 2023

add common vector DB base class #107

Merged

pmeier closed this as completed in #107 Oct 25, 2023

nenb mentioned this issue Nov 5, 2023

External contributions accepted? #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify embedding model / tokenizer for builtin source storages? #71

Unify embedding model / tokenizer for builtin source storages? #71

pmeier commented Oct 12, 2023

dillonroach commented Oct 12, 2023

pmeier commented Oct 25, 2023

Unify embedding model / tokenizer for builtin source storages? #71

Unify embedding model / tokenizer for builtin source storages? #71

Comments

pmeier commented Oct 12, 2023

dillonroach commented Oct 12, 2023

pmeier commented Oct 25, 2023