LanceDB - Remove Orphaned Chunks #1620

Pipboyguy · 2024-07-21T21:08:47Z

Description

This PR lays the groundwork for managing chunked documents and their embeddings efficiently in LanceDB, focusing merge with referential integrity. This PR does not implement document chunking/splitting, which will be addressed in #1615.

Automatically remove orphaned chunks when the parent document is updated or deleted.

Related Issues

Closes Chunk and Embedding Management in LanceDB #1587
Relates to Efficient Update Strategy for Chunked Documents in Vector Databases #1533
Relates to User-Provided Text Splitters for Document Chunking #1615

Signed-off-by: Marcel Coetzee <[email protected]>

…pdate-strategy-for-chunked-documents

Signed-off-by: Marcel Coetzee <[email protected]>

…pdate-strategy-for-chunked-documents

Signed-off-by: Marcel Coetzee <[email protected]>

…afety Signed-off-by: Marcel Coetzee <[email protected]>

netlify · 2024-07-21T21:09:06Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`a5a1657`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66f3129615f22e00080eaf5c

Signed-off-by: Marcel Coetzee <[email protected]>

sh-rp · 2024-07-23T08:52:43Z

dlt/destinations/impl/lancedb/lancedb_client.py

+            # Remove orphaned parent IDs.
+            if parent_table_name:
+                try:
+                    parent_tbl = db_client.open_table(parent_table_name)


this will not work, there is no guarantee that the parent table is fully loaded (with all possible load jobs) at the point in time of the execution of the child table job. If we do it this way, we will need table chain followup jobs, similar to the way merge jobs are done. You can have a look at the default sql client for that.

Scheduled orphan removals separately in LanceDBRemoveOrphansJob

sh-rp · 2024-07-23T08:53:10Z

dlt/destinations/impl/lancedb/lancedb_client.py

+                        "Couldn't open lancedb database. Batch WILL BE RETRIED"
+                    ) from e
+
+                parent_ids = set(pc.unique(parent_tbl.to_arrow()["_dlt_id"]).to_pylist())


I wonder if this approach will work well for large dataset..

Subqueries don't work so we can't use the datafusion sql engine here. It will have to be batched. there will have to be some processing done client side

@sh-rp Made this easier on the client by utilizing lance projection push-down, only pulling the columns we need

sh-rp · 2024-07-23T08:58:22Z

Thanks for starting work on this. I am wondering if it would be possible to get this to work without relying on the parent tables so that adding a parent table is optional in this scenario. Theoretically it should be possible to use the dlt_load_id to discover outdated embeddings, but I have not thought it through 100%. Do you think the following is somehow possible assuming we are only using 1 table:

Every "original" document has a unique ID which will be inherited as a column by all embeddings (it will just be sent as column from the resource but we need a hint for the schema to identify this column)
We have this compound key created from the original document id and the row hash for doing the merge update the way you already do it (I think)
After the full table is loaded, we run a follow up job, which will identify each "original" document id that has new data loaded during this run (we can find those with the dlt_load_id) and then will do a delete on all records that match these original document IDs not do NOT match the current load id.

…pdate-strategy-for-chunked-documents

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy · 2024-07-27T15:17:35Z

@sh-rp I've added this hint to be used in with root table orphan removal. It works and is absolutely a valid strategy!

If the user wants to remove root table orphans, they need to explicitly define the hint as mentioned. Surprisingly, the primary and merge key works for nested tables though.

Regardless, in order to avoid confusion, I made the document id field hint raise an exception on merge disposition if primary key is also defined as this leads to confusing behaviour.

…-efficient-update-strategy-for-chunked-documents

Signed-off-by: Marcel Coetzee <[email protected]>

…efficient-update-strategy-for-chunked-documents # Conflicts: # dlt/destinations/impl/lancedb/lancedb_adapter.py # tests/load/lancedb/test_merge.py

Signed-off-by: Marcel Coetzee <[email protected]>

rudolfix

yeah we are getting there!

rudolfix · 2024-09-03T08:51:52Z

dlt/destinations/impl/lancedb/lancedb_client.py

@@ -192,11 +206,14 @@ def upload_batch(
        elif write_disposition == "replace":
            tbl.add(records, mode="overwrite")


you cannot overwrite tables here. what if you have many jobs for a single table? they will override themselves. please rewrite your tests for replace to generate many rows and use

os.environ["DATA_WRITER__BUFFER_MAX_ITEMS"] = "2" os.environ["DATA_WRITER__FILE_MAX_ITEMS"] = "2"

which will make sure that you get many jobs. (make sure in your test that you have many jobs per table)

dlt already takes care of truncating the right tables in initialize_storage. other destinations simply do append or merge here

rudolfix · 2024-09-03T09:00:06Z

dlt/destinations/impl/lancedb/lancedb_client.py

+            with FileStorage.open_zipsafe_ro(file_path, mode="rb") as f:
+                payload_arrow_table: pa.Table = pq.read_table(f)
+
+            if target_is_root_table:


OK this is really good but IMO there's a mistake when deleting nested tables:

to delete orphan from root table we use merge key of the root table (good)

to delete from child table we use all dlt_ids that you can find in root table - not in child table why? because if you remove (1) all nested elements in root table you will not delete all orphans (2) or you remove one row from root table you also won't find the proper root key

please write tests for that.

also could you take taxi dataset and try to merge all rows (where IN clause will be insane long). I wonder what happens: will that be o(n^2) operation or they will do a hash table on IN clause

#%% import time import lancedb import matplotlib.pyplot as plt import numpy as np import pyarrow as pa #%% dim = 1536 num_rows = 1_000_000 batch_size = 10_000 lancedb_path = "/tmp/.lancedb" table_name = "vectors" #%% def next_batch(batch_size, offset): values = pa.array(np.random.rand(dim * batch_size).astype('float32')) return pa.table({ 'id': pa.array([offset + j for j in range(batch_size)]), 'vector': pa.FixedSizeListArray.from_arrays(values, dim), 'metric': pa.array(np.random.rand(batch_size)), }).to_batches()[0] def batch_iter(num_rows): i = 0 while i < num_rows: current_batch_size = min(batch_size, num_rows - i) yield next_batch(current_batch_size, i) i += current_batch_size def create_filter_condition(field_name: str, values: np.ndarray) -> str: return f"{field_name} IN ({', '.join(map(str, values))})" #%% db = lancedb.connect(lancedb_path) schema = next_batch(1, 0).schema table = db.create_table(table_name, data=batch_iter(num_rows), schema=schema, mode="overwrite") in_clause_sizes = [1000, 5000, 10000, 50000, 100000] execution_times_no_index = [] execution_times_with_index = [] # Measure execution times without an index. for size in in_clause_sizes: unique_ids = np.random.choice(num_rows, size, replace=False) filter_condition = create_filter_condition('id', unique_ids) start_time = time.time() _ = table.search().where(filter_condition).limit(10).to_pandas() end_time = time.time() execution_time_no_index = end_time - start_time execution_times_no_index.append(execution_time_no_index) print(f"Without Index - IN clause size: {size}, Execution time: {execution_time_no_index:.2f} seconds") # Measure execution times with index. table.create_scalar_index("id", index_type="BTREE") for size in in_clause_sizes: unique_ids = np.random.choice(num_rows, size, replace=False) filter_condition = create_filter_condition('id', unique_ids) start_time = time.time() _ = table.search().where(filter_condition).limit(10).to_pandas() end_time = time.time() execution_time_with_index = end_time - start_time execution_times_with_index.append(execution_time_with_index) print(f"With Index - IN clause size: {size}, Execution time: {execution_time_with_index:.2f} seconds") #%% plt.figure(figsize=(10, 6)) plt.plot(in_clause_sizes, execution_times_no_index, marker='o', label='Without Index', color='blue') plt.plot(in_clause_sizes, execution_times_with_index, marker='*', label='With Index', color='red') plt.title('Execution Time vs IN Clause Size') plt.xlabel('Number of IDs in IN Clause') plt.ylabel('Execution Time (seconds)') plt.grid(True) plt.legend() plt.show() #%%

Setting a scalar index does seem to offers constant time complexity. Clearly this is worth investigating for future PRs.

dlt/destinations/impl/lancedb/lancedb_client.py

…egy-for-chunked-documents # Conflicts: # dlt/destinations/impl/lancedb/lancedb_client.py # docs/website/docs/dlt-ecosystem/destinations/lancedb.md # poetry.lock # tests/load/lancedb/test_pipeline.py # tests/load/lancedb/utils.py

… conditions with replace disposition Signed-off-by: Marcel Coetzee <[email protected]>

Signed-off-by: Marcel Coetzee <[email protected]>

…egy-for-chunked-documents

…egy-for-chunked-documents # Conflicts: # poetry.lock

Signed-off-by: Marcel Coetzee <[email protected]>

…egy-for-chunked-documents # Conflicts: # dlt/destinations/impl/lancedb/factory.py # dlt/destinations/impl/lancedb/lancedb_client.py # dlt/destinations/impl/lancedb/schema.py # poetry.lock # tests/load/lancedb/test_pipeline.py

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy added 12 commits July 16, 2024 20:56

Add tests for LanceDB chunking and merging functionality

68e26a0

Signed-off-by: Marcel Coetzee <[email protected]>

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

7c2d031

…pdate-strategy-for-chunked-documents

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

4c555e9

…pdate-strategy-for-chunked-documents

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

6c734d7

…pdate-strategy-for-chunked-documents

Add TSplitter type alias for LanceDB document splitting function

900c4fa

Signed-off-by: Marcel Coetzee <[email protected]>

Refine typing for chunks

16230a7

Signed-off-by: Marcel Coetzee <[email protected]>

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

d3aeda2

…pdate-strategy-for-chunked-documents

Add type definitions for chunk splitter function and related types

3f7a82f

Signed-off-by: Marcel Coetzee <[email protected]>

Remove unused ChunkInputT, ChunkOutputT, and TSplitter type definitions

1dda1d5

Signed-off-by: Marcel Coetzee <[email protected]>

Implement efficient update strategy for chunked documents in LanceDB

48e14ab

Signed-off-by: Marcel Coetzee <[email protected]>

Implement efficient update strategy for chunked documents in LanceDB

32fe174

Signed-off-by: Marcel Coetzee <[email protected]>

Refactor LanceDB client and tests for improved readability and type s…

d974962

…afety Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy linked an issue Jul 21, 2024 that may be closed by this pull request

Chunk and Embedding Management in LanceDB #1587

Open

Pipboyguy self-assigned this Jul 21, 2024

Pipboyguy added bug Something isn't working destination Issue related to new destinations labels Jul 21, 2024

Linting

e6cdf5d

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy requested review from sh-rp and rudolfix July 21, 2024 21:21

sh-rp reviewed Jul 23, 2024

View reviewed changes

Pipboyguy added 4 commits July 23, 2024 17:07

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

c7c2bc6

…pdate-strategy-for-chunked-documents

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

bf3c3d8

…pdate-strategy-for-chunked-documents

Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…

9c11964

…pdate-strategy-for-chunked-documents

Add document_id parameter to lancedb_adapter and update merge logic

a60737a

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy added 2 commits July 29, 2024 11:57

Merge remote-tracking branch 'origin/devel' into 1587-lancedb-support…

cfe1a6d

…-efficient-update-strategy-for-chunked-documents

Remove resolved comments

518a507

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy added 14 commits September 2, 2024 12:29

Bump pyarrow version

7099d5f

Signed-off-by: Marcel Coetzee <[email protected]>

Use pa.nulls instead of [None]*len

1c770d1

Signed-off-by: Marcel Coetzee <[email protected]>

Update tests

0b11ac7

Signed-off-by: Marcel Coetzee <[email protected]>

Invert remove orphans flag

e81736e

Signed-off-by: Marcel Coetzee <[email protected]>

Implement root table orphan deletion, only integer doc_ids

36abec7

Signed-off-by: Marcel Coetzee <[email protected]>

Cater for string ids as well in doc_id removal process

5ceeda9

Signed-off-by: Marcel Coetzee <[email protected]>

Fix test with wrong primary key

a8f9c3b

Signed-off-by: Marcel Coetzee <[email protected]>

Just send list of ids as is. don't pc.compute on client end

b3baf93

Signed-off-by: Marcel Coetzee <[email protected]>

Extract schema matching into utils

589071c

Signed-off-by: Marcel Coetzee <[email protected]>

Add utils

a86a13a

Signed-off-by: Marcel Coetzee <[email protected]>

Pass all tests

0eba25e

Signed-off-by: Marcel Coetzee <[email protected]>

Minor format and cleanup

2b7f4c6

Signed-off-by: Marcel Coetzee <[email protected]>

Merge branch 'remove-lancedb-doc-id-hints' into 1587-lancedb-support-…

105b388

…efficient-update-strategy-for-chunked-documents # Conflicts: # dlt/destinations/impl/lancedb/lancedb_adapter.py # tests/load/lancedb/test_merge.py

Docs

ea36b00

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy requested a review from rudolfix September 3, 2024 00:01

rudolfix requested changes Sep 3, 2024

View reviewed changes

Pipboyguy added 6 commits September 5, 2024 14:50

Amend replace test to test with large number of records to catch race…

81eaea9

… conditions with replace disposition Signed-off-by: Marcel Coetzee <[email protected]>

Fix replace race conditions by delegating truncation to dlt

f6d243a

Signed-off-by: Marcel Coetzee <[email protected]>

Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…

3521975

…egy-for-chunked-documents

Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…

e280001

…egy-for-chunked-documents # Conflicts: # poetry.lock

Update lock file

f32d4cd

Signed-off-by: Marcel Coetzee <[email protected]>

rudolfix force-pushed the devel branch 2 times, most recently from 2ee3eab to e48f641 Compare September 16, 2024 13:20

sh-rp force-pushed the devel branch from ec730e8 to fcc4c45 Compare September 17, 2024 10:04

Pipboyguy added 4 commits September 24, 2024 19:26

Refactor type mapping and schema handling in LanceDB client

7bd2e9c

Signed-off-by: Marcel Coetzee <[email protected]>

Change 'complex' column type to 'json' in LanceDB client

d8a6b75

Signed-off-by: Marcel Coetzee <[email protected]>

update lock file

a5a1657

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy requested a review from rudolfix September 25, 2024 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LanceDB - Remove Orphaned Chunks #1620

LanceDB - Remove Orphaned Chunks #1620

Pipboyguy commented Jul 21, 2024 •

edited

Loading

netlify bot commented Jul 21, 2024 •

edited

Loading

sh-rp Jul 23, 2024

Pipboyguy Jul 31, 2024 •

edited

Loading

sh-rp Jul 23, 2024

Pipboyguy Aug 5, 2024 •

edited

Loading

Pipboyguy Aug 6, 2024 •

edited

Loading

sh-rp commented Jul 23, 2024

Pipboyguy commented Jul 27, 2024 •

edited

Loading

rudolfix left a comment

rudolfix Sep 3, 2024

Pipboyguy Sep 5, 2024

rudolfix Sep 3, 2024

Pipboyguy Sep 25, 2024

Pipboyguy Sep 25, 2024

		@@ -192,11 +206,14 @@ def upload_batch(
		elif write_disposition == "replace":
		tbl.add(records, mode="overwrite")

LanceDB - Remove Orphaned Chunks #1620

Are you sure you want to change the base?

LanceDB - Remove Orphaned Chunks #1620

Conversation

Pipboyguy commented Jul 21, 2024 • edited Loading

Description

Related Issues

netlify bot commented Jul 21, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp Jul 23, 2024

Choose a reason for hiding this comment

Pipboyguy Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

sh-rp Jul 23, 2024

Choose a reason for hiding this comment

Pipboyguy Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Pipboyguy Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

sh-rp commented Jul 23, 2024

Pipboyguy commented Jul 27, 2024 • edited Loading

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix Sep 3, 2024

Choose a reason for hiding this comment

Pipboyguy Sep 5, 2024

Choose a reason for hiding this comment

rudolfix Sep 3, 2024

Choose a reason for hiding this comment

Pipboyguy Sep 25, 2024

Choose a reason for hiding this comment

Pipboyguy Sep 25, 2024

Choose a reason for hiding this comment

Pipboyguy commented Jul 21, 2024 •

edited

Loading

netlify bot commented Jul 21, 2024 •

edited

Loading

Pipboyguy Jul 31, 2024 •

edited

Loading

Pipboyguy Aug 5, 2024 •

edited

Loading

Pipboyguy Aug 6, 2024 •

edited

Loading

Pipboyguy commented Jul 27, 2024 •

edited

Loading