Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanceDB - Remove Orphaned Chunks #1620

Open
wants to merge 145 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
68e26a0
Add tests for LanceDB chunking and merging functionality
Pipboyguy Jul 16, 2024
7c2d031
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 16, 2024
4c555e9
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 17, 2024
6c734d7
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 18, 2024
900c4fa
Add TSplitter type alias for LanceDB document splitting function
Pipboyguy Jul 18, 2024
16230a7
Refine typing for chunks
Pipboyguy Jul 18, 2024
d3aeda2
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 19, 2024
3f7a82f
Add type definitions for chunk splitter function and related types
Pipboyguy Jul 19, 2024
1dda1d5
Remove unused ChunkInputT, ChunkOutputT, and TSplitter type definitions
Pipboyguy Jul 19, 2024
48e14ab
Implement efficient update strategy for chunked documents in LanceDB
Pipboyguy Jul 21, 2024
32fe174
Implement efficient update strategy for chunked documents in LanceDB
Pipboyguy Jul 21, 2024
d974962
Refactor LanceDB client and tests for improved readability and type s…
Pipboyguy Jul 21, 2024
e6cdf5d
Linting
Pipboyguy Jul 21, 2024
c7c2bc6
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 23, 2024
bf3c3d8
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 24, 2024
9c11964
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 25, 2024
a60737a
Add document_id parameter to lancedb_adapter and update merge logic
Pipboyguy Jul 27, 2024
cfe1a6d
Merge remote-tracking branch 'origin/devel' into 1587-lancedb-support…
Pipboyguy Jul 29, 2024
518a507
Remove resolved comments
Pipboyguy Jul 29, 2024
c10bd73
Implement efficient orphan removal for chunked documents in LanceDB
Pipboyguy Jul 29, 2024
24ada84
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 30, 2024
5b3acb1
Implement efficient update strategy for chunked documents in LanceDB
Pipboyguy Jul 30, 2024
cf6d86a
Add test for removing orphaned records in LanceDB
Pipboyguy Jul 30, 2024
d338586
Update LanceDB orphaned records removal test for chunked documents
Pipboyguy Jul 30, 2024
2376c6a
Set test pipeline as dev mode
Pipboyguy Jul 30, 2024
7f6f1cd
Fix write disposition check in LanceDBRemoveOrphansJob execute method
Pipboyguy Jul 30, 2024
b840f8b
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Jul 31, 2024
c276211
Add FollowupJob trait to LoadLanceDBJob
Pipboyguy Jul 31, 2024
dbfd5af
Fix file type
Pipboyguy Jul 31, 2024
257fbde
Fix file typing
Pipboyguy Jul 31, 2024
0502ddf
Add test for removing orphaned records in LanceDB root table
Pipboyguy Jul 31, 2024
2363b51
Enhance LanceDB test to cover nested child removal and update scenarios
Pipboyguy Jul 31, 2024
a296c77
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Aug 1, 2024
6b363d1
Use doc id hint for top level tables
Pipboyguy Aug 1, 2024
aac7647
Only join on join columns for orphan removal job
Pipboyguy Aug 1, 2024
e33b7cf
Add ollama to supported embedding providers and test orphaned record …
Pipboyguy Aug 1, 2024
afa7573
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Aug 2, 2024
f2913e9
Add merge_key to document resource for efficient updates in LanceDB
Pipboyguy Aug 2, 2024
ffe6584
Formatting
Pipboyguy Aug 2, 2024
0368018
Set default file size to 128MB
Pipboyguy Aug 2, 2024
29fa7fd
Merge branch 'refs/heads/devel' into 1587-lancedb-support-efficient-u…
Pipboyguy Aug 3, 2024
02704d5
Only use parquet loader file formats
Pipboyguy Aug 3, 2024
eae056a
Import pyarrow.parquet
Pipboyguy Aug 4, 2024
dc20a55
Remove recommended file size from LanceDB destination capabilities
Pipboyguy Aug 4, 2024
6ed540b
Update LanceDB client to use more efficient batch processing methods …
Pipboyguy Aug 4, 2024
0a9682f
Refactor unique identifier handling for LanceDB tables
Pipboyguy Aug 5, 2024
a99224a
Optimize UUID column generation for LanceDB tables
Pipboyguy Aug 5, 2024
895331b
Refactor LanceDBClient to use string type hints for Table
Pipboyguy Aug 5, 2024
a881e7a
Minor refactor
Pipboyguy Aug 5, 2024
7f245e2
Implement efficient schema update with Nullability support
Pipboyguy Aug 5, 2024
4fc73dd
Optimize orphaned chunks removal for large datasets
Pipboyguy Aug 5, 2024
9378f50
Projection pushdown
Pipboyguy Aug 6, 2024
9b14583
Format
Pipboyguy Aug 6, 2024
e21f61b
Prevent primary key and document ID hint conflict in merge disposition
Pipboyguy Aug 6, 2024
9725d0e
Add recommended file size for LanceDB destination
Pipboyguy Aug 7, 2024
5238c11
Improve comment clarity for projection push-down in LanceDB
Pipboyguy Aug 7, 2024
8e74815
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 7, 2024
c8f7468
Update to new load interface
Pipboyguy Aug 7, 2024
af56191
Remove unnecessary LanceDBLoadJob attributes
Pipboyguy Aug 8, 2024
7e33011
Change instance attributes to `run` method as variables
Pipboyguy Aug 8, 2024
e24e961
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 9, 2024
ee7dd02
Schedule follow up refernce job
Pipboyguy Aug 9, 2024
df498ab
Add follow up lancedb remove orphan job skeleron
Pipboyguy Aug 10, 2024
c08f1ba
Write empty follow up file
Pipboyguy Aug 10, 2024
f9f94e3
Write parquet
Pipboyguy Aug 10, 2024
b374b0b
Add support for reference file format in LanceDB destination
Pipboyguy Aug 10, 2024
2ed3301
Handle parent table name resolution if it doesn't exist in Lance db r…
Pipboyguy Aug 10, 2024
cb0ba1f
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 12, 2024
99ac100
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 14, 2024
0694859
Refactor specialised orphan follow up job back to reference job
Pipboyguy Aug 15, 2024
ad3b750
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 16, 2024
4701c6e
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 17, 2024
537a2be
Refactor orphan removal for chunked documents
Pipboyguy Aug 17, 2024
3d25306
Fix dlt system table check for name instead of object
Pipboyguy Aug 18, 2024
2ee8da1
Implement staging methods
Pipboyguy Aug 18, 2024
2947d55
Override staging client methods
Pipboyguy Aug 19, 2024
ea5914c
Docs
Pipboyguy Aug 19, 2024
2e7daed
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 19, 2024
5018adf
Merge branch 'devel' into 1705-lancedb-orphan-removal-via-staging-del…
Pipboyguy Aug 19, 2024
1fcce51
Override staging client methods
Pipboyguy Aug 20, 2024
8849f11
Delete with inserts
Pipboyguy Aug 20, 2024
c7098fd
Keep with batch reader
Pipboyguy Aug 20, 2024
92ba767
Merge branch 'devel' into 1705-lancedb-orphan-removal-via-staging-del…
Pipboyguy Aug 21, 2024
abd9b01
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Aug 22, 2024
d8ddcae
Remove Lancedb client's staging implementation
Pipboyguy Aug 22, 2024
17137a6
Insert in memory arrow table. This will be optimized
Pipboyguy Aug 22, 2024
1b0b7bb
Merge branch 'devel' into 1705-lancedb-orphan-removal-via-staging-del…
Pipboyguy Aug 26, 2024
53d896a
Rename classes to the new job implementation classes
Pipboyguy Aug 26, 2024
26ba0f5
Use namedtuple for table chain to improve readability
Pipboyguy Aug 26, 2024
06e04d9
Remove orphans by loading all ancestor IDs simultaneously
Pipboyguy Aug 26, 2024
470315e
Fix doc_id adapter
Pipboyguy Aug 26, 2024
43eb5b4
Fix doc_id adapter
Pipboyguy Aug 26, 2024
40a5e73
Revert to previous
Pipboyguy Aug 26, 2024
04c8489
Merge branch 'devel' into 1705-lancedb-orphan-removal-via-staging-del…
Pipboyguy Aug 27, 2024
8cd6003
Revert "Remove orphans by loading all ancestor IDs simultaneously"
Pipboyguy Aug 27, 2024
dad103e
Remove doc_id hint
Pipboyguy Aug 27, 2024
15a0cf6
Infer merge key if not supplied from provided primary key
Pipboyguy Aug 27, 2024
e9462e3
Remove unused utility functions
Pipboyguy Aug 27, 2024
8af98d7
Remove LanceDB doc ID hints and use schema normalizer
Pipboyguy Aug 27, 2024
4195bb4
LanceDB writes strange code
Pipboyguy Aug 27, 2024
2573d3a
Minor Formatting
Pipboyguy Aug 27, 2024
19e9366
Merge branch 'devel' into remove-lancedb-doc-id-hints
Pipboyguy Aug 28, 2024
86c198c
Support compound primary and merge keys
Pipboyguy Aug 28, 2024
aa03930
Remove old comment
Pipboyguy Aug 28, 2024
fb72c03
Merge branch 'devel' into remove-lancedb-doc-id-hints
Pipboyguy Aug 29, 2024
d1e4173
- Change default vector column name to "vector" to conform with lance…
Pipboyguy Aug 29, 2024
613f5bc
Format and fix linting
Pipboyguy Aug 29, 2024
703c4a8
Add custom embedding function registration test
Pipboyguy Aug 29, 2024
c07c8fc
Spawn process in test to make sure registry can be deserialized from …
Pipboyguy Aug 29, 2024
8afa7e1
Simplify null string handling
Pipboyguy Aug 29, 2024
2395432
Change NULL string replacement with random string, doc clarification
Pipboyguy Aug 30, 2024
2507d22
Merge branch 'devel' into remove-lancedb-doc-id-hints
Pipboyguy Aug 31, 2024
9a347e6
Update default vector column name in docs
Pipboyguy Aug 31, 2024
4eda894
Merge branch 'devel' into remove-lancedb-doc-id-hints
Pipboyguy Sep 2, 2024
c0bedb7
Set `remove_orphans` flag to False on tests that don't require it
Pipboyguy Sep 2, 2024
99a4f44
Merge branch '1765-lancedb-destination-cant-query-generated-tables' i…
Pipboyguy Sep 2, 2024
5f0d620
Implement starter arrow string placeholder function
Pipboyguy Sep 2, 2024
b7f3076
Add test for empty arrow string element vectorised replacement utilit…
Pipboyguy Sep 2, 2024
e3a4ed0
Handle NULL values in addition to empty strings in arrow substitution…
Pipboyguy Sep 2, 2024
4ec894f
More efficient empty value replacement with canonical arrow usage
Pipboyguy Sep 2, 2024
9866874
Format
Pipboyguy Sep 2, 2024
7099d5f
Bump pyarrow version
Pipboyguy Sep 2, 2024
1c770d1
Use pa.nulls instead of [None]*len
Pipboyguy Sep 2, 2024
0b11ac7
Update tests
Pipboyguy Sep 2, 2024
e81736e
Invert remove orphans flag
Pipboyguy Sep 2, 2024
36abec7
Implement root table orphan deletion, only integer doc_ids
Pipboyguy Sep 2, 2024
5ceeda9
Cater for string ids as well in doc_id removal process
Pipboyguy Sep 2, 2024
a8f9c3b
Fix test with wrong primary key
Pipboyguy Sep 2, 2024
b3baf93
Just send list of ids as is. don't pc.compute on client end
Pipboyguy Sep 2, 2024
589071c
Extract schema matching into utils
Pipboyguy Sep 2, 2024
a86a13a
Add utils
Pipboyguy Sep 2, 2024
0eba25e
Pass all tests
Pipboyguy Sep 2, 2024
2b7f4c6
Minor format and cleanup
Pipboyguy Sep 2, 2024
105b388
Merge branch 'remove-lancedb-doc-id-hints' into 1587-lancedb-support-…
Pipboyguy Sep 2, 2024
ea36b00
Docs
Pipboyguy Sep 3, 2024
2010722
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Sep 5, 2024
81eaea9
Amend replace test to test with large number of records to catch race…
Pipboyguy Sep 5, 2024
f6d243a
Fix replace race conditions by delegating truncation to dlt
Pipboyguy Sep 5, 2024
3521975
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Sep 6, 2024
e280001
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Sep 8, 2024
f32d4cd
Update lock file
Pipboyguy Sep 8, 2024
a804e03
Merge branch 'devel' into 1587-lancedb-support-efficient-update-strat…
Pipboyguy Sep 24, 2024
7bd2e9c
Refactor type mapping and schema handling in LanceDB client
Pipboyguy Sep 24, 2024
d8a6b75
Change 'complex' column type to 'json' in LanceDB client
Pipboyguy Sep 24, 2024
a5a1657
update lock file
Pipboyguy Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions dlt/destinations/impl/lancedb/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ class LanceDBClientOptions(BaseConfiguration):
"sentence-transformers",
"huggingface",
"colbert",
"ollama",
]


Expand Down Expand Up @@ -92,8 +93,6 @@ class LanceDBClientConfiguration(DestinationClientDwhConfiguration):
Make sure it corresponds with the associated embedding model's dimensionality."""
vector_field_name: str = "vector"
"""Name of the special field to store the vector embeddings."""
id_field_name: str = "id__"
"""Name of the special field to manage deduplication."""
sentinel_table_name: str = "dltSentinelTable"
"""Name of the sentinel table that encapsulates datasets. Since LanceDB has no
concept of schemas, this table serves as a proxy to group related dlt tables together."""
Expand Down
8 changes: 6 additions & 2 deletions dlt/destinations/impl/lancedb/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ class lancedb(Destination[LanceDBClientConfiguration, "LanceDBClient"]):

def _raw_capabilities(self) -> DestinationCapabilitiesContext:
caps = DestinationCapabilitiesContext()
caps.preferred_loader_file_format = "jsonl"
caps.supported_loader_file_formats = ["jsonl"]
caps.preferred_loader_file_format = "parquet"
caps.supported_loader_file_formats = ["parquet", "reference"]
caps.type_mapper = LanceDBTypeMapper

caps.max_identifier_length = 200
Expand All @@ -42,6 +42,10 @@ def _raw_capabilities(self) -> DestinationCapabilitiesContext:
caps.timestamp_precision = 6
caps.supported_replace_strategies = ["truncate-and-insert"]

caps.recommended_file_size = 128_000_000

caps.supported_merge_strategies = ["upsert"]

return caps

@property
Expand Down
36 changes: 32 additions & 4 deletions dlt/destinations/impl/lancedb/lancedb_adapter.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,20 @@
from typing import Any
from typing import Any, Dict

from dlt.common.schema.typing import TColumnNames, TTableSchemaColumns
from dlt.destinations.utils import get_resource_for_adapter
from dlt.extract import DltResource
from dlt.extract.items import TTableHintTemplate


VECTORIZE_HINT = "x-lancedb-embed"
NO_REMOVE_ORPHANS_HINT = "x-lancedb-remove-orphans"


def lancedb_adapter(
data: Any,
embed: TColumnNames = None,
merge_key: TColumnNames = None,
no_remove_orphans: bool = False,
) -> DltResource:
"""Prepares data for the LanceDB destination by specifying which columns should be embedded.

Expand All @@ -20,6 +24,10 @@ def lancedb_adapter(
object.
embed (TColumnNames, optional): Specify columns to generate embeddings for.
It can be a single column name as a string, or a list of column names.
merge_key (TColumnNames, optional): Specify columns to merge on.
It can be a single column name as a string, or a list of column names.
no_remove_orphans (bool): Specify whether to remove orphaned records in child
tables with no parent records after merges to maintain referential integrity.

Returns:
DltResource: A resource with applied LanceDB-specific hints.
Expand All @@ -34,6 +42,7 @@ def lancedb_adapter(
"""
resource = get_resource_for_adapter(data)

additional_table_hints: Dict[str, TTableHintTemplate[Any]] = {}
column_hints: TTableSchemaColumns = {}

if embed:
Expand All @@ -50,9 +59,28 @@ def lancedb_adapter(
VECTORIZE_HINT: True, # type: ignore[misc]
}

if not column_hints:
raise ValueError("A value for 'embed' must be specified.")
if merge_key:
if isinstance(merge_key, str):
merge_key = [merge_key]
if not isinstance(merge_key, list):
raise ValueError(
"'merge_key' must be a list of column names or a single column name as a string."
)

for column_name in merge_key:
column_hints[column_name] = {
"name": column_name,
"merge_key": True,
}

additional_table_hints[NO_REMOVE_ORPHANS_HINT] = no_remove_orphans

if column_hints or additional_table_hints:
resource.apply_hints(columns=column_hints, additional_table_hints=additional_table_hints)
else:
resource.apply_hints(columns=column_hints)
raise ValueError(
"You must must provide at least either the 'embed' or 'merge_key' or 'remove_orphans'"
" argument if using the adapter."
)

return resource
Loading
Loading