-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify embedding model / tokenizer for builtin source storages? #71
Comments
Per our side conversation - https://huggingface.co/spaces/mteb/leaderboard has a set of benchmarks they run against a number of these.. as they say 'your mileage may vary' but it's a decent starting point - the bge, and https://huggingface.co/BAAI/bge-small-en-v1.5 in particular, jump out as setting a good balance between performance and size. If the goal is to match what's used else-where, |
With #72 we no longer have the massive overhead at import, but still we pulling in multiple GBs as dependencies. A docker image based on The folks over at chroma had the same issue and solved it:
|
We currently used different embedding models and tokenizers for our builtin source storages:
ragna/ragna/source_storage/_chroma.py
Lines 39 to 42 in e851297
ragna/ragna/source_storage/_lancedb.py
Line 36 in e851297
ragna/ragna/source_storage/_lancedb.py
Lines 51 to 52 in e851297
So far I've just used whatever the documentation of the respective tool suggested. For Chroma that wasn't really an issue. However for LanceDB, added in #66, this added tons of heavy dependencies:
ragna/ragna/source_storage/_lancedb.py
Lines 21 to 25 in e851297
Since we build
ragna.builtin_config
at import timeragna/ragna/__init__.py
Lines 35 to 38 in e851297
and
PackageRequirement.is_available()
performs the import, we now have a crazy overhead:That in itself wouldn't be the issue if the specific embedding model / tokenizer is required for LanceDB. But it isn't.
My proposal is twofold:
The text was updated successfully, but these errors were encountered: