-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor change iceberg source interface #178
base: main
Are you sure you want to change the base?
Conversation
@ashvina @the-other-tim-brown LMK what you think ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm
Assertions.assertEquals(0, commitsBacklog.getInFlightInstants().size()); | ||
Assertions.assertArrayEquals(snapshots, commitsBacklog.getCommitsToProcess().toArray()); | ||
Long[] snapshotIds = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use a list here?
@@ -52,7 +52,7 @@ | |||
|
|||
@Log4j2 | |||
@Builder | |||
public class IcebergSourceClient implements SourceClient<Snapshot> { | |||
public class IcebergSourceClient implements SourceClient<Long> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a bit confusing. Why is a generic required in the contract if all commits have to be identified by a long id. If the justification is alignment with delta and hudi, then all future sources will need to ensure a long id for commit is available. IMO, a snapshot is a cleaner and less ambiguous representation for a source client. Moreover, just because serialization is efficient with a long, the entire contract should not be changed.
I think such changes should have more compelling reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is allowing us to move towards your suggestion here: #135 - Maybe we have misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashvina The Generics is still required because for Hudi it is HoodieInstant, For Delta it is Long in the current state. The entity in the generics semantically represents a way to get source table state at that point. If possible it is better to choose a leaner entity(We can add to docs) and snapshot Id for Iceberg fits this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be careful not to rush this change. These abstractions are fundamental to OneTable and will affect its future extensibility. My main concern is that we are altering the model because of how one of the components behaves, namely the serialization of the last and pending commits. Let me know if I misunderstood this.
In my opinion, we need a new representation for commits, OneCommit. Just as a table's state is mapped to OneTable, a source client implementation should map the metadata of a table format's commit to OneCommit.
Regarding the tracking of the commit backlog, if OneCommit is available, the serializer or any other component can decide independently what details are best and necessary to serialize. This should cleanly decouple the client's implementation and the behavior of core OneTable components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's work through the round trip conversion testing to see where that leads us. I think that will allow us to have a better vision for where we should be going with respect to the commit representation and serialization
Important Read
What is the purpose of the pull request
Minor change in Iceberg source interface to consider snapshotId as the checkpoint as it aligns with Hudi Instant and Delta version number and has little serialization overhead.
Brief change log
Minor change in Iceberg source interface to consider snapshotId as the checkpoint as it aligns with Hudi Instant and Delta version number and has little serialization overhead.
Verify this pull request
(Please pick either of the following options)
This pull request is already covered by existing tests, such as (please describe tests).