Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: BigQuery common and provisioner classes #138

Merged
merged 25 commits into from
Apr 2, 2024

Conversation

man8pr
Copy link
Contributor

@man8pr man8pr commented Mar 8, 2024

What this PR changes/adds

Added:

  • extensions/common/gcp/gcp-core/.../bigquery : common classes including BQ DataAddress, service client, and theBigQuery DataAddress Validator Extension to register DataAddress validators
  • extensions/control-plane/provision/provision-bigquery : BQ provisioner for generating access token for data plane sink, including the BigQuery Provision Extension to register the provisioner

Why it does that

The PR adds first basic version of the BigQuery extension to exchange data to and from BigQuery tables.

Further notes

  • To be added in following PR:

    • BigQuery data plane
  • Limitations for this version:

    • provisioner requires destination table to exist, doesn't create new table if it doesn't
    • data is committed to the destination table only if no error occurred during the data transmission
    • no transmission retry in case of error
    • results are fetched via pages (size is fixed, 4 rows), max. size for serialized page data is 16 KiB
    • credentials for BQ services have default 1hr lifetime

Linked Issue(s)

Closes #50

Added:
- extensions/common/gcp/gcp-core/.../bigquery : common classes including BQ DataAddress, service client, and theBigQuery DataAddress Validator Extension to register DataAddress validators
- extensions/control-plane/provision/provision-bigquery : BQ provisioner for generating access token for data plane sink, including the BigQuery Provision Extension to register the provisioner

Limitations:
    - provisioner requires destination table to exist, doesn't create new table if it doesn't
    - data is committed to the destination table only if no error occurred during the data transmission
    - no transmission retry in case of error
    - results are fetched via pages (size is fixed, 4 rows), max. size for serialized page data is 16 KiB
@man8pr man8pr added the enhancement New feature or request label Mar 8, 2024
@ndr-brt ndr-brt self-requested a review March 8, 2024 07:39
@paullatzelsperger
Copy link
Member

@man8pr I added the "Closes" keyword to the PR description, to automatically link the PR with the issue

Copy link
Member

@paullatzelsperger paullatzelsperger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff overall, comments inline

testRunSinkQuery(false);
}

private void testRunSourceQuery(String sinkType) throws InterruptedException, IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: method ordering should be such, that test methods go first, then protected methods then private methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ordering

}

@Test
void testSinkValidatorShouldNotValidateIncompleteAddresses() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i'd prefer having one Test method per scenario, i.e. every violation should have its own method. That greatly improves maintenance and readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the method into single violation methods


import static org.assertj.core.api.Assertions.assertThat;

public class PartAssert extends AbstractAssert<PartAssert, Part> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc. and does this have to be public? seems like package-private would be fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


@Override
public @Nullable
ResourceDefinition generate(DataRequest dataRequest, Policy policy) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up: the signature for this interface changed from DataRequest -> TransferProcess. you should rebase onto latest main

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated interface

Copy link
Member

@ndr-brt ndr-brt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused mainly on the service implementation, the big issue here is the thread safety and how this class is used.

…n error fix, ExecutorService for running the query

- Split BigQueryService and BigQueryServiceImpl in 3 parts, provision, source and sink, to address the specific needs
- BigQueryDataAddress not serializable
- BigQuerySourceServiceImpl using ExecutorService to track threading and thread pools
- BigQuerySourceServiceImplTest.java less mocks to test the runSourceQuery function
- BigQueryConsumerResourceDefinitionGenerator and test updated to reflect the new API
Copy link
Member

@ndr-brt ndr-brt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thread-safety issues are still there.

As a general rule, I suggest to remove from this PR all the code that's only used by tests (because it's dead-code) and to put it into a separate pr that will show their usage within an integration/e2e test, that will make the review easier.

@man8pr man8pr requested a review from ndr-brt March 20, 2024 10:29
man8pr and others added 5 commits March 20, 2024 11:58
…h BigQueryFactory

- replaced BigQueryProviderService* classes with BigQueryFactory / BigQuery for simpler implementation
- removed TypeManager member class from Provisioner (not used)
- removed unused classes in the test implementation (will be bundled with the data plane classes)
Comment on lines 95 to 98
var credentialProject = project;
if (credentialProject == null) {
credentialProject = gcpConfiguration.projectId();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this credentialProject could be a field of the service and initiated in the build() method, so there'll be no reason to initialize it twice (line 95 and line 131)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class has been removed

this.monitor = monitor;
}

private void initCredentials() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method could return a GoogleCredentials without setting fields (like it's done at line 111 and 116). Doing this these side effects will disappear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class has been removed

Comment on lines 48 to 53
reset(gcpConfiguration);
reset(monitor);
reset(bigQuery);
reset(table);
reset(typeManager);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset is not needed here, please take a look at other tests around the EDC repositories to see how we mock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class has been removed

.build();

// New seed for each test.
random = new Random();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class has been removed

}

if (bqFactory == null) {
bqFactory = new BigQueryFactoryImpl();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the factory is instantiated here? I'd say that it should be done in the extension and passed as collaborator of the provisioner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

@Override
public BigQuery createBigQuery(GcpConfiguration gcpConfiguration, Monitor monitor) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both GcpConfiguration and Monitor should be passed in construction, there's no point in passing them here.
Plus, if the method is not depending on "flow data" like "projectId" (as it was in the previous implementation), there's no need at all to have this factory, because the BigQuery instance build by the factory will always be the same, so it could be instantiated by the extension and injected directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, BigQuery instance is different if the transfer request includes the name of a service account to be impersonated for the access

// TODO update target with the generated table name.

try {
String serviceAccountEmail = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable is not used, please remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

when(table.exists()).thenReturn(true);
when(bigQuery.getTable(TEST_TARGET.getTableId())).thenReturn(table);

var response = bigQueryProvisioner.provision(resourceDefinition, policy).join();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid using join and use assertJ async capabilities (assertThat(future).succeedsWithin(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (thanks for the hint, the succeedsWithin is really good)

Comment on lines 138 to 147
assertThat(response.succeeded()).isTrue();
assertThat(response.getContent().getResource()).isInstanceOfSatisfying(BigQueryProvisionedResource.class, resource -> {
assertThat(resource.getId()).isEqualTo(RESOURCE_ID);
assertThat(resource.getTransferProcessId()).isEqualTo(TRANSFER_ID);
assertThat(resource.getProject()).isEqualTo(TEST_PROJECT);
assertThat(resource.getDataset()).isEqualTo(TEST_DATASET);
assertThat(resource.getTable()).isEqualTo(TEST_TABLE);
assertThat(resource.hasToken()).isTrue();
assertThat(resource.getCustomerName()).isEqualTo(CUSTOMER_NAME);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the token creation logic is not tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for testing the creation of the token

@man8pr man8pr requested a review from ndr-brt March 26, 2024 18:51
.build().getService();
}

private BigQuery createBigQuery() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused, please delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.isEqualTo(true);
}

private class BigQueryFactoryTest implements BigQueryFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a mockito mock instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return new BigQueryTarget(project, dataset, table);
}

public static class Builder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this builder is not really needed as it's relying on a single newInstance method + build, just use the constructor instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +96 to +103
GcpAccessToken token = null;

if (serviceAccount != null) {
token = iamService.createAccessToken(serviceAccount);
} else {
serviceAccount = ADC_SERVICE_ACCOUNT;
token = iamService.createDefaultAccessToken();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this piece of code is duplicated in the GcsProvisioner, could be the iamService.createAccessToken that detects if the service account passed is a default one and in that case it will return the default access token?
This will simplify things here and give the knowledge about accounts to the iam service (that's the one that should have them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, delegating the responsibility to IAM is good. Since I would change then both BQ and GCS provisioner, along with IAM, does it make sense to have this in a dedicated PR?

}
}

private void provisionSucceeds(String serviceAccountName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a safe way to test, because you pretty much replicated the production code logic, so it's more error prone, please structure the tests in the way that they are understandable from the test method itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good point, as exchanged, I make a first try, by clearly showing the setup using default credentials, and the setup for testing the use of specific service account

@Override
protected void verify() {
super.verify();
// TODO verify required fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please solve this TODO as it seems not too hard to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* Schema for BigQuery service.
*/
public interface BigQueryService {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public interface BigQueryService {
public interface BigQueryServiceSchema {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

… method

- better grouping of test set up when using default credentials and when using specific service account
@man8pr man8pr requested a review from ndr-brt March 28, 2024 13:54
Copy link
Member

@ndr-brt ndr-brt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking forward for the "service account refactoring", please solve DEPENDENCIES and I will merge this

@man8pr
Copy link
Contributor Author

man8pr commented Apr 2, 2024

looking forward for the "service account refactoring", please solve DEPENDENCIES and I will merge this

Sure, on the refactoring now, it shouldn't be a large change

@ndr-brt ndr-brt merged commit aa67ca9 into eclipse-edc:main Apr 2, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery extension
3 participants