Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavioral change to Avro codecs and schema handling #3238

Merged
merged 2 commits into from
Aug 24, 2023

Conversation

dlvenable
Copy link
Member

@dlvenable dlvenable commented Aug 24, 2023

Description

This PR makes a fundamental change to how the Avro-based codecs (Avro and Parquet) handle mapping Events to Avro records.

There are now two cases which are handled differently:

  1. If the pipeline author defines a schema, this takes precedence over the incoming data. Thus, extra fields will be ignored.
  2. If the schema is auto-generated, then future events must match the schema. Thus, extra fields will drop the events.

Issues Resolved

N/A

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…hema to get data, rather than iterate over the entire record. Fix Avro arrays which were only supporting arrays of strings previously.

Signed-off-by: David Venable <[email protected]>
…ely on the schema rather than the incoming event. If the schema is auto-generated, then the incoming event data must continue to match.

Signed-off-by: David Venable <[email protected]>
Copy link
Collaborator

@asifsmohammed asifsmohammed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the behavior if the incoming events has less keys than the schema?

@dlvenable
Copy link
Member Author

dlvenable commented Aug 24, 2023

What would be the behavior if the incoming events has less keys than the schema?

It depends on the schema.

If the schema does not allow it to be null, then it will fail to write the Avro record. However, if the schema allows null, it will be saved as null.

This test should cover it: convertEventDataToAvro_skips_non_present_fields.

for (String key : getKeyNames(schema, eventData, codecContext, rootOfData)) {
final Schema.Field field = schema.getField(key);
if (field == null) {
throw new RuntimeException("The event has a key ('" + key + "') which is not included in the schema.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior for this exception? We include it as one of the Events that has to be dropped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will show in the sampled log as merged in #3242.

@dlvenable dlvenable merged commit f17e833 into opensearch-project:main Aug 24, 2023
26 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 24, 2023
Change the behavior of Avro-based codecs. When a schema is defined, rely on the schema rather than the incoming event. If the schema is auto-generated, then the incoming event data must continue to match. Fix Avro arrays which were only supporting arrays of strings previously.

Signed-off-by: David Venable <[email protected]>
(cherry picked from commit f17e833)
dlvenable added a commit that referenced this pull request Aug 24, 2023
Change the behavior of Avro-based codecs. When a schema is defined, rely on the schema rather than the incoming event. If the schema is auto-generated, then the incoming event data must continue to match. Fix Avro arrays which were only supporting arrays of strings previously.

Signed-off-by: David Venable <[email protected]>
(cherry picked from commit f17e833)

Co-authored-by: David Venable <[email protected]>
@dlvenable dlvenable deleted the avro-schema-first branch November 10, 2023 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants