Release v0.9 release: stats-based predicate pushdown, scalar index, performance improvement, and bug fixes · lancedb/lance

Summary

Stats-based predicate pushdown
Scalar index
Tensorflow and PyTorch data loader
Pre / post filter combined with vector search
Performance improvement across the stack

Breaking changes:

Change IVF_PQ algorithm for cosine distance. Requires rebuilding index with cosine distance.
Bump pyarrow version to 12.0+

What's Changed

feat: add a cache for dynamodb schema validation by @chebbyChefNEQ in #1308
chore: speed up kmean training for cosine by @eddyxu in #1334
feat: add take() method to RecordBatchExt by @eddyxu in #1337
feat(python): expose row id in python API by @eddyxu in #1339
feat: data generation of dbpedia dataset by @eddyxu in #1340
feat: build ivf partition using disk based shuffler by @eddyxu in #1312
feat: friendlier error messages in nearest api by @rok in #1336
chore: index / recall benchmark over dbpedia by @eddyxu in #1348
feat: support storing page-level stats by @wjones127 in #1316
fix: pq cosine fast lookup table by @eddyxu in #1354
chore: compute distance using pytorch and GPU/MPS by @eddyxu in #1351
feat: train kmean using pytorch by @eddyxu in #1358
build: use larger runner for doc build by @eddyxu in #1364
feat(python): gpu based ivf partition training by @eddyxu in #1361
fix: stop reading latest manifest by @wjones127 in #1365
chore: improve kmean training performance on CUDA by @eddyxu in #1368
chore: improve kmean performance on MPS by @eddyxu in #1370
chore: use torch.index_add to compute new centroids, to improve training performance on MPS by @eddyxu in #1371
feat: schema::field_by_id() by @eddyxu in #1375
feat(python): design an image extension type by @rok in #1272
perf: improve KNN and ANN performance by @wjones127 in #1367
feat: collect int/float/boolean/date page-level statistics on write by @rok in #1346
chore: part 1/N of refactoring vector index into separate crate by @eddyxu in #1388
fix: handle larger arrays in take by @wjones127 in #1383
chore: cleanup tests to avoid errors when optional components are not present by @westonpace in #1374
chore: object write trait by @eddyxu in #1389
refactor: fast path to find fragments for flat scan by @eddyxu in #1394
refactor: move reader trait to lance-core by @eddyxu in #1393
refactor: move pq to lance-index by @eddyxu in #1400
refactor: make pq a batch transformer by @eddyxu in #1401
feat: run pq portion of ivf_pq in parallel by @westonpace in #1386
feat: generic shuffler over RecordBatchStream by @eddyxu in #1402
feat: remap indices on compaction by @westonpace in #1403
chore: remove unused crate by @eddyxu in #1405
docs: make overwrite row green by @wjones127 in #1409
feat: add removed_indices to CreateIndex transaction operation by @eddyxu in #1408
feat(rust): incremental index update by @eddyxu in #1406
feat(python): expose index optimization via python by @eddyxu in #1412
test: add test case to ensure optimize returns to flat KNN by @westonpace in #1416
fix: fix bug in index remapping when plan contained multiple rewrite groups by @westonpace in #1415
chore: upgrade to datafusion 32 by @wjones127 in #1391
ci: cross compile arm wheels by @wjones127 in #1407
test: add new ann scenarios to the python benchmarks by @westonpace in #1411
chore: instrument various steps in the ann search by @westonpace in #1404
refactor: refactor flat search to lance-index by @eddyxu in #1419
refactor: move encodings to lance-core by @eddyxu in #1425
feat: expose latest version id api by @chebbyChefNEQ in #1426
refactor: use function pointers instead of trait objects by @wjones127 in #1424
feat: automatically convert image to tensors in TF data pipeline by @rok in #1420
refactor: migrate schema and data types to lance-core by @eddyxu in #1429
perf: better parallelism in delete vector prefiltering by @westonpace in #1428
test: fix flaky tests involving tokio::fs::File by @westonpace in #1430
perf: use selection vector strategy to improve exact knn performance with deletions by @wjones127 in #1418
chore: use arrow 47 function by @eddyxu in #1439
refactor: move format definitions to lance-core by @eddyxu in #1440
refactor: migrate object reader and object writer by @eddyxu in #1442
fix: fix an issue where the GPU index trainer was taking too much data into memory by @westonpace in #1447
feat: store a separate tensor blob for IVF centroids by @eddyxu in #1446
refactor: move all Python operations to the same runtime by @wjones127 in #1445
chore: bump prost version to latest by @eddyxu in #1449
chore: update half to 2.3.1 by @jacobBaumbach in #1450
feat: allow prefiltering to be used with an index by @westonpace in #1435
feat: benchmark and improve L2 partition compute by @eddyxu in #1453
chore: increase ivf assignment parallism during indexing by @eddyxu in #1451
feat: support keyboard interrupt in Python by @wjones127 in #1438
feat: add parameter to split by file size by @wjones127 in #1444
ci: fix ARM build due to Ring dependency by @wjones127 in #1462
refactor: move read and write manifest file to lance-core by @eddyxu in #1467
feat: create_index take torch.device object by @eddyxu in #1465
feat: added dataset stats api by @albertlockett in #1452
refactor: move commit traits to lance-core by @eddyxu in #1469
chore: use ruff format to replace isort and black by @eddyxu in #1472
refactor: move ObjectStore, FileReader and FileWriter to lance-core by @eddyxu in #1473
perf: support multi-threading shuffler by @eddyxu in #1474
feat: poor man's SIMD lib by @eddyxu in #1478
perf: use simd lib to implement dot by @eddyxu in #1480
feat: expose progress on write_fragments and write_dataset by @wjones127 in #1464
feat: split out datagen utilities, expand them, expose to python by @westonpace in #1315
chore: remove outdated warnings about prefiltering with a vector index by @westonpace in #1484
fix: fix L2 computation on GPU by @eddyxu in #1485
perf: improve kmeans and make pq training multi-threaded by @eddyxu in #1479
chore: mention GPU support in README by @eddyxu in #1489
fix: fix PQ training metric type is not appropriately propogated by @eddyxu in #1493
docs: clarify behaviour of refine_factor by @albertlockett in #1496
ci: cancel in progress runs on new push by @albertlockett in #1497
chore: remove unused value settings by @eddyxu in #1494
feat: provide a f32x16 abstraction to make unrolling 256-bit code easier by @eddyxu in #1495
fix: remove channel closed messages by @wjones127 in #1502
perf: dimension-based kernel for L2 and Cosine by @eddyxu in #1503
feat: add location for all error by @Weijun-H in #1475
feat: add sorting to the scanner by @westonpace in #1498
feat: add tf.data APIs for reading batches by @wjones127 in #1488
feat: experimental avx512 features by @eddyxu in #1506
feat: add read ahead for take scan by @wjones127 in #1501
feat: use caller location in error conversion functions by @chebbyChefNEQ in #1510
chore(rust): reduce debug message log level by @changhiskhan in #1512
feat: collect page-level statistics on write by @rok in #1335
feat(rust): simd ops of reduce min, min, find and gather by @eddyxu in #1514
feat: add btree scalar index by @westonpace in #1476
feat: support true in deletion logic by @Weijun-H in #1515
fix: make sure we have physical rows by @wjones127 in #1511
chore: benchmark of large IVF parrtitions by @eddyxu in #1524
feat: make dot generic to support bf16/f16/f32 with one dot_distance interface. by @eddyxu in #1522
chore: add same target-features to python pyo3 build by @eddyxu in #1527
feat: expose index cache configure via open dataset API by @eddyxu in #1523
fix: fix assertion of cosine values by @eddyxu in #1530
feat: generic cosine code by @eddyxu in #1537
perf: improve f16 performance for norm L2 on aarch64 by @eddyxu in #1539
feat: make L2 generic to work with all float numbers by @eddyxu in #1532
fix: pq index does not handle dot product metric correctly during search by @rok in #1536
chore: move scalar_index benchmark to break circular dependency by @westonpace in #1540
feat: safer API for physical_rows by @wjones127 in #1529
feat: implement datafusion tableprovider trait for Dataset by @universalmind303 in #1526
feat: expose Dataset.validate() in Python by @wjones127 in #1538
fix: add versioning and bypass broken row counts by @wjones127 in #1534
feat: generic kmeans that supports bf16 and f16 by @eddyxu in #1544
chore: disable avx512 for now by @eddyxu in #1546
chore: fix type inference errors in benchmarks by @westonpace in #1556
chore: provide a trait to dynamically dispatch different pq based on different vector data type by @eddyxu in #1555
chore: update the CI build to check/build all crates in the workspace and not just the lance crate by @westonpace in #1557
feat: make it possible to create and load scalar indices for a dataset by @westonpace in #1516
feat: generic Product Quantizatoin by @eddyxu in #1560
test: add property-based testing for statistics by @wjones127 in #1554
feat: ffi to accelerate norm_l2 for f16 if the instruction set is available by @eddyxu in #1562
feat: extend FSL with sample by @eddyxu in #1572
feat: allow for more advanced storage options in objectstore by @universalmind303 in #1547
feat: implement as_slice for bfloat16 array by @eddyxu in #1574
perf: add bf16 benchmarks by @eddyxu in #1575
feat: f16 for L2 by @eddyxu in #1577
chore: update cc dependency to 1.0.83 by @westonpace in #1578
feat: make IVF model support f16 and bf16 by @eddyxu in #1573
feat: allow the scanner to take advantage of scalar indices by @westonpace in #1543
chore: dotprod should be on mac target, not haswell and better randomness for bf16 by @westonpace in #1579
feat: make Dataset::nearest() accepts arbitrary query type by @eddyxu in #1582
chore(rust): remove extraneous dbg message by @changhiskhan in #1598
feat: torch cache-able dataset, with sampling support by @eddyxu in #1591
fix: tell writer correct schema when writing index file by @wjones127 in #1518
feat: add support for remapping scalar indices during compaction by @westonpace in #1571
refactor: switch to using DataFusions physical expr by @wjones127 in #1581
chore: various fixes for Python benchmarks by @wjones127 in #1513
feat: adaptive cuda allocation for l2/cosine distance computation by @eddyxu in #1601
fix: fix a memory leak where a dataset would not be fully deleted by @westonpace in #1606
fix: google objectstore uses proper gs configuration by @universalmind303 in #1608
perf: kmean fit uses cached torch dataset by @eddyxu in #1603
fix: add migration for bad fragment bitmaps by @westonpace in #1611
feat: allow scalar indices to be updated with new data by @westonpace in #1576
feat: add python bindings for creating scalar indices by @westonpace in #1592
fix: handle no max value for string by @wjones127 in #1600
feat: expose index cache size by @rok in #1587
feat: track index cache hit rate by @rok in #1586
feat: serialize arbitrary float type of PQ to protobuf by @eddyxu in #1624
ci: use M1 runner for now for release by @wjones127 in #1623
feat: coerce float array for nearest query by @eddyxu in #1618
chore: expose avx512fp16 feature via main lance crate by @eddyxu in #1626
feat: make partition calculation parallel by @chebbyChefNEQ in #1625
feat(rust): simplify object store option API by @wjones127 in #1627
fix: fix chunk size issue by @wjones127 in #1630
perf: more efficient treemap implementation for row ids by @wjones127 in #1632
feat(python): add index_cache_hit_rate to index_stats() by @rok in #1631
chore: make lance-linalg benchmark ready to test bf16 data by @eddyxu in #1634
perf: fast L2 distance table build by @eddyxu in #1639
fix: correctly avg centroids in update logic in GPU IVF training by @chebbyChefNEQ in #1646
perf: add a fast path for converting bytes into array when the bytes has the correct alignment by @chebbyChefNEQ in #1652
fix: prevent OOM when IVF centroids are provided by @wjones127 in #1653
test: fix for test by @wjones127 in #1644
perf: minor change to cleanup allowing for size to be collected in parallel by @westonpace in #1649
perf: add type coersion for in-list expressions by @westonpace in #1655
chore: minor changes to tracing instrumentation by @westonpace in #1619
fix: fix error message for invalid nprobes by @albertlockett in #1666
feat: add support for update queries by @wjones127 in #1585
fix: support no-op filters again by @wjones127 in #1669
fix: row_id range fix for index training on gpu by @jerryyifei in #1663
feat: better warnings when the PQ assignment over cosine distance is wrong by @eddyxu in #1672
fix: add retries for failed response stream by @wjones127 in #1671
chore: add utility to compute ground truth for benchmarks by @eddyxu in #1668
fix: dont use scalar indices unless we are prefiltering by @westonpace in #1678
fix: lance pytorch dataset parameter to load with row_id by @eddyxu in #1676
feat: a tensor dataset that shared with the same behavior as Lance torch Dataset by @eddyxu in #1679
chore: add new python benchmarks for testing scalar indices by @westonpace in #1658
feat: add option to pass in precomputed row_id -> ivf partiton mapping and compute partiiton on GPU by @chebbyChefNEQ in #1680
fix: make sure to prefilter the flat portion of a combined knn by @westonpace in #1583
perf: use datafusion to shuffle index partition data by @wjones127 in #1645
feat: add batch buffering and async loading to torch.LanceDataset by @chebbyChefNEQ in #1687
feat: optimized pushdown scanner by @wjones127 in #1328
fix: add shutdown to async loader by @chebbyChefNEQ in #1690
fix: use eplison to handle all zero cosine values by @eddyxu in #1696
fix: prevent stats meta from breaking old readers by @wjones127 in #1699
fix: add _rowid when use_stats=False by @wjones127 in #1700
perf: revert back to hashmap by @chebbyChefNEQ in #1692
fix: remove default memory cap for index training by @wjones127 in #1702
feat: do not use residual vector for cosine similarity by @eddyxu in #1708
feat: add support for new and deleted data to scalar indices by @westonpace in #1689
fix: update list_indices to report if an index is vector or scalar by @westonpace in #1710
perf: allow take to process multiple fragments in parallel by @westonpace in #1713
feat: turn on argument tracking in tracing by @wjones127 in #1706
perf: make sure we use multiple threads when scanning by @wjones127 in #1705
chore: kmeans fit takes pyarrow FixedSizeListArray by @eddyxu in #1714
revert: use eplison to handle all zero cosine values (#1696) by @eddyxu in #1715
chore: add ruff copyright check by @eddyxu in #1716
chore: compute pairwise cosine using pytorch by @eddyxu in #1717
chore: normalize vector kernel by @eddyxu in #1720
fix: fix l2 normalize by @eddyxu in #1722
perf: use an asynchronous open function even for local files by @westonpace in #1721
perf: small performance fixes for scan by @wjones127 in #1719
fix: cosine kmeans by @eddyxu in #1723
fix: cosine kmeans on GPU by @eddyxu in #1726
fix: pq code for cosine distance by @eddyxu in #1727
chore: adjust cosine value from l2 distance by @eddyxu in #1730
fix: various fixes to GPU kmeans by @chebbyChefNEQ in #1731
feat: handroll ivf partition shuffle by @chebbyChefNEQ in #1729

Full Changelog: v0.8.0...v0.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9 release: stats-based predicate pushdown, scalar index, performance improvement, and bug fixes

Summary

Breaking changes:

What's Changed

Contributors