v0.3.8 Improved random access for non-numeric columns and duckdb extension
You can now query lance datasets outside of python using duckdb! Thanks to @dacort for making the lance extension play nice with duckdb. dbt-duckdb-lance anyone? You can find the extension under integration/duckdb_lance
.
We're also very excited to release a very substantial performance optimization for random access for non-numeric columns.
Previously, if you wanted to fetch a string or blob column along with nearest neighbor search results, the non-optimized binary decoder take could add up to 5-20x latency overhead, depending on the sparsity of the indices. In this release we've optimized the take performance so this is basically a free operation.
While most of the work in Rust is completed for filter pushdown, we've had to delay the general release for this feature until we're able to overcome some rough edges making pyarrow compute Expressions play nice with datafusion and sqlparser-rs. It'll be worth the wait though we promise!
Cosine similarity is shipped but the recall performance is lower, due to some issues during index creation. We recommend that you stick with the default L2 distance metric until we address this in the coming few releases.
We'd love to hear from you!
What's Changed
- Update extension for v0.7.0 compatibility by @dacort in #599
- Remove -j from DuckDB build script by @changhiskhan in #601
- a minor preparatory refactor by @changhiskhan in #598
- fix gha duckdb trigger paths by @changhiskhan in #602
- Use MetricType to specify the metric / distance compute function by @eddyxu in #600
- [Python] Specify metric type in Dataset.create_index by @eddyxu in #603
- [Rust] Implement a datafusion phyiscal expr Column that can reads nested columns by @eddyxu in #610
- benchmark query performance on 768D vectors by @changhiskhan in #607
- Parse sql filter clause to create datafusion physical expression by @eddyxu in #609
- Schema exclude fields by @eddyxu in #613
- Exec filter during Scan by @eddyxu in #612
- workaround to prevent the segfault until we figure out the real problem by @changhiskhan in #616
- Improve random access on binary encoding by @eddyxu in #615
- [Python] Support filter pushdown from Python Dataset API by @eddyxu in #618
- refactor benchmark to use cosine similarity by @changhiskhan in #611
- Encoding shared slices of arrays. by @eddyxu in #620
- Fix plain encoding by @eddyxu in #622
- Fix crash with column projection with ann search by @eddyxu in #624
- Relax data type matching float numbers in filter pushdown by @eddyxu in #625
- python integration tests for vector index by @changhiskhan in #623
- Remove filter pushdown from python api for now by @changhiskhan in #628
- PlainDecoder take on boolean values by @eddyxu in #627
- remove debug prints by @changhiskhan in #633
- Scan node to detect channel close and gracefully break the scan. by @eddyxu in #635
New Contributors
Full Changelog: v0.3.7...v0.3.8