-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor change iceberg source interface #178
Open
vamshigv
wants to merge
1
commit into
main
Choose a base branch
from
iceberg_snapshot_id
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,7 +86,7 @@ void getTableTest(@TempDir Path workingDir) throws IOException { | |
IcebergSourceClient client = clientProvider.getSourceClientInstance(sourceTableConfig); | ||
|
||
Snapshot snapshot = catalogSales.currentSnapshot(); | ||
OneTable oneTable = client.getTable(snapshot); | ||
OneTable oneTable = client.getTable(snapshot.snapshotId()); | ||
Assertions.assertNotNull(oneTable); | ||
Assertions.assertEquals(TableFormat.ICEBERG, oneTable.getTableFormat()); | ||
Assertions.assertTrue(oneTable.getName().endsWith("catalog_sales")); | ||
|
@@ -121,7 +121,7 @@ public void getSchemaCatalogTest(@TempDir Path workingDir) throws IOException { | |
IcebergSourceClient client = clientProvider.getSourceClientInstance(sourceTableConfig); | ||
IcebergSourceClient spyClient = spy(client); | ||
|
||
SchemaCatalog schemaCatalog = spyClient.getSchemaCatalog(null, iceCurrentSnapshot); | ||
SchemaCatalog schemaCatalog = spyClient.getSchemaCatalog(null, iceCurrentSnapshot.snapshotId()); | ||
Assertions.assertNotNull(schemaCatalog); | ||
Map<SchemaVersion, OneSchema> schemas = schemaCatalog.getSchemas(); | ||
Assertions.assertEquals(1, schemas.size()); | ||
|
@@ -156,8 +156,9 @@ public void testGetCurrentSnapshot(@TempDir Path workingDir) throws IOException | |
Assertions.assertEquals( | ||
String.valueOf(iceCurrentSnapshot.snapshotId()), oneSnapshot.getVersion()); | ||
Assertions.assertNotNull(oneSnapshot.getTable()); | ||
verify(spyClient, times(1)).getTable(iceCurrentSnapshot); | ||
verify(spyClient, times(1)).getSchemaCatalog(oneSnapshot.getTable(), iceCurrentSnapshot); | ||
verify(spyClient, times(1)).getTable(iceCurrentSnapshot.snapshotId()); | ||
verify(spyClient, times(1)) | ||
.getSchemaCatalog(oneSnapshot.getTable(), iceCurrentSnapshot.snapshotId()); | ||
verify(spyPartitionConverter, times(5)).toOneTable(any(), any(), any()); | ||
verify(spyDataFileExtractor, times(5)).fromIceberg(any(), any(), any()); | ||
|
||
|
@@ -315,9 +316,11 @@ private void validatePendingCommits(Table table, Snapshot lastSync, Snapshot... | |
.lastSyncInstant(Instant.ofEpochMilli(lastSync.timestampMillis())) | ||
.build(); | ||
IcebergSourceClient sourceClient = getIcebergSourceClient(table); | ||
CommitsBacklog<Snapshot> commitsBacklog = sourceClient.getCommitsBacklog(instant); | ||
CommitsBacklog<Long> commitsBacklog = sourceClient.getCommitsBacklog(instant); | ||
Assertions.assertEquals(0, commitsBacklog.getInFlightInstants().size()); | ||
Assertions.assertArrayEquals(snapshots, commitsBacklog.getCommitsToProcess().toArray()); | ||
Long[] snapshotIds = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we just use a list here? |
||
Arrays.stream(snapshots).map(snapshot -> snapshot.snapshotId()).toArray(Long[]::new); | ||
Assertions.assertArrayEquals(snapshotIds, commitsBacklog.getCommitsToProcess().toArray()); | ||
} | ||
|
||
private static long getDataFileCount(Table catalogSales) throws IOException { | ||
|
@@ -329,7 +332,7 @@ private static long getDataFileCount(Table catalogSales) throws IOException { | |
private void validateTableChangeDiffSize( | ||
Table table, Snapshot snapshot, int addedFiles, int removedFiles) { | ||
IcebergSourceClient sourceClient = getIcebergSourceClient(table); | ||
TableChange tableChange = sourceClient.getTableChangeForCommit(snapshot); | ||
TableChange tableChange = sourceClient.getTableChangeForCommit(snapshot.snapshotId()); | ||
Assertions.assertEquals(addedFiles, tableChange.getFilesDiff().getFilesAdded().size()); | ||
Assertions.assertEquals(removedFiles, tableChange.getFilesDiff().getFilesRemoved().size()); | ||
} | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a bit confusing. Why is a generic required in the contract if all commits have to be identified by a long id. If the justification is alignment with delta and hudi, then all future sources will need to ensure a long id for commit is available. IMO, a snapshot is a cleaner and less ambiguous representation for a source client. Moreover, just because serialization is efficient with a long, the entire contract should not be changed.
I think such changes should have more compelling reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is allowing us to move towards your suggestion here: #135 - Maybe we have misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashvina The Generics is still required because for Hudi it is HoodieInstant, For Delta it is Long in the current state. The entity in the generics semantically represents a way to get source table state at that point. If possible it is better to choose a leaner entity(We can add to docs) and snapshot Id for Iceberg fits this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be careful not to rush this change. These abstractions are fundamental to OneTable and will affect its future extensibility. My main concern is that we are altering the model because of how one of the components behaves, namely the serialization of the last and pending commits. Let me know if I misunderstood this.
In my opinion, we need a new representation for commits, OneCommit. Just as a table's state is mapped to OneTable, a source client implementation should map the metadata of a table format's commit to OneCommit.
Regarding the tracking of the commit backlog, if OneCommit is available, the serializer or any other component can decide independently what details are best and necessary to serialize. This should cleanly decouple the client's implementation and the behavior of core OneTable components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's work through the round trip conversion testing to see where that leads us. I think that will allow us to have a better vision for where we should be going with respect to the commit representation and serialization