Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify embedding model / tokenizer for builtin source storages? #71

Closed
pmeier opened this issue Oct 12, 2023 · 2 comments · Fixed by #107
Closed

Unify embedding model / tokenizer for builtin source storages? #71

pmeier opened this issue Oct 12, 2023 · 2 comments · Fixed by #107

Comments

@pmeier
Copy link
Member

pmeier commented Oct 12, 2023

We currently used different embedding models and tokenizers for our builtin source storages:

So far I've just used whatever the documentation of the respective tool suggested. For Chroma that wasn't really an issue. However for LanceDB, added in #66, this added tons of heavy dependencies:

PackageRequirement("lancedb>=0.2"),
# FIXME: re-add this after https://github.com/apache/arrow/issues/38167 is
# resolved.
# PackageRequirement("pyarrow"),
PackageRequirement("sentence-transformers"),

Since we build ragna.builtin_config at import time

ragna/ragna/__init__.py

Lines 35 to 38 in e851297

for module, cls in [(source_storage, SourceStorage), (assistant, Assistant)]:
for obj in module.__dict__.values():
if isinstance(obj, type) and issubclass(obj, cls) and obj.is_available():
builtin_config.register_component(obj)

and PackageRequirement.is_available() performs the import, we now have a crazy overhead:

  • With LanceDB
    $ time python -c "import ragna"
    python -c "import ragna"  6,23s user 2,41s system 124% cpu 6,930 total
    
  • Without LanceDB
    $ time python -c "import ragna"
    python -c "import ragna"  2,28s user 1,63s system 156% cpu 2,497 total
    
  • Without LanceDB and Chroma
    $ time python -c "import ragna"
    python -c "import ragna"  1,26s user 0,20s system 99% cpu 1,458 total
    

That in itself wouldn't be the issue if the specific embedding model / tokenizer is required for LanceDB. But it isn't.

My proposal is twofold:

  1. We should use the same embedding model / tokenizer for all builtin source storages. Since the source storages basically just store vectors, I currently can't imagine a case where one would require a specific configuration. Even then, providing the same for all other source storages means that we keep our dependencies to a minimum and in turn also the import time.
  2. Instead of using a "random" embedding model / tokenizer, we should lightweight ones. The ones used by Chroma look like a good starting point, but maybe we can do better? @dillonroach do you have insights here?
@dillonroach
Copy link

Per our side conversation - https://huggingface.co/spaces/mteb/leaderboard has a set of benchmarks they run against a number of these.. as they say 'your mileage may vary' but it's a decent starting point - the bge, and https://huggingface.co/BAAI/bge-small-en-v1.5 in particular, jump out as setting a good balance between performance and size.

If the goal is to match what's used else-where, encoding = tiktoken.get_encoding("cl100k_base") is the 'default' for GPT4/3.5-turbo and text-embedding-ada-002; and it's also worth noting that the tokenizer bundled with llama models is a BPE model based on sentencepiece. There's some good work done in the latest transformers release specific to tokenizers, https://github.com/huggingface/transformers/releases/tag/v4.34.0, and one can go digging there for some of the latest changes.

@pmeier
Copy link
Member Author

pmeier commented Oct 25, 2023

With #72 we no longer have the massive overhead at import, but still we pulling in multiple GBs as dependencies. A docker image based on python:3.11 is ~6GB big.

The folks over at chroma had the same issue and solved it:

# In order to remove dependencies on sentence-transformers, which in turn depends on
# pytorch and sentence-piece we have created a default ONNX embedding function that
# implements the same functionality as "all-MiniLM-L6-v2" from sentence-transformers.
# visit https://github.com/chroma-core/onnx-embedding for the source code to generate
# and verify the ONNX model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants