Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(go/adbc/driver/snowflake): improve GetObjects performance and semantics #2254

Merged
merged 14 commits into from
Oct 17, 2024

Conversation

zeroshade
Copy link
Member

Fixes #2171

Improves the channel handling and query building for metadata conversion to Arrow for better performance.

For all cases except when retrieving Column metadata we'll now utilize SHOW queries and build the patterns into those queries. This allows those GetObjects calls with appropriate depths to be called without having to specify a current database or schema.

@github-actions github-actions bot added this to the ADBC Libraries 15 milestone Oct 14, 2024
@zeroshade
Copy link
Member Author

CC @davidhcoe Can you take a look at this and confirm it fixes your issues with the exception of the All and Columns depths?

Comment on lines +96 to +99
if before[len(before)-1] != '\\' {
b.WriteByte('\\')
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is to handle pre-escaped characters? But what if the escape is itself escaped? (Or is that not allowed?)

Copy link
Member

@lidavidm lidavidm Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not from our own spec 😅, we specify escapes aren't supported at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not from our own spec 😅, we specify escapes aren't supported at all

Yeah, I pointed this out in #1508

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time I open a branch for 1.2.0...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time I open a branch for 1.2.0...

I've started some work on that in conjunction with multiple results sets. https://github.com/CurtHagenlocher/arrow-adbc/tree/MoreResults

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is intended to handle pre-escaped characters. The logic is taken from snowflake's JDBC driver, I figured handling one level of escaping was sufficient given our current spec. 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time I open a branch for 1.2.0...

I've started some work on that in conjunction with multiple results sets. https://github.com/CurtHagenlocher/arrow-adbc/tree/MoreResults

Oh this is great!

go/adbc/driver/snowflake/connection.go Outdated Show resolved Hide resolved
Comment on lines 166 to 176
gQueryIDs.Go(func() error {
return conn.Raw(func(driverConn any) (err error) {
query := "SHOW TERSE /* ADBC:getObjectsDBSchemas */ DATABASES"
if catalog != nil && len(*catalog) > 0 && *catalog != "%" && *catalog != ".*" {
query += " LIKE '" + escapeSingleQuoteForLike(*catalog) + "'"
}
query += " IN ACCOUNT"

terseDbQueryID, err = getQueryID(gQueryIDsCtx, query, driverConn)
return
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are repeated across cases. Could we extract them out of the switch-case to avoid the duplication?

@joellubi
Copy link
Member

Thanks @zeroshade! Any rough performance numbers for using SHOW to get the DB objects rather than information_schema?

@zeroshade
Copy link
Member Author

@joellubi Most of the performance actually came from the improved handling of the channels rather than the switch to using SHOW since they only replaced the calls to selecting from information_schema.schemata etc.

The way the channels were being handled caused bottlenecks since we weren't using buffered channels and the record reader was being passed through a channel instead of just using it directly. Switching up the managing of the channels led to about a 25% improvement in performance by removing the blocking. My tests showed a drop from ~5s to ~3.5s for a large GetObjects scenario. About 2/3 of the time is the raw snowflake execution. which for the ADBC account is taking a total of around 2 - 3 seconds depending on the query for all of the SHOW queries + the primary one

@joellubi
Copy link
Member

@joellubi Most of the performance actually came from the improved handling of the channels rather than the switch to using SHOW since they only replaced the calls to selecting from information_schema.schemata etc.

The way the channels were being handled caused bottlenecks since we weren't using buffered channels and the record reader was being passed through a channel instead of just using it directly. Switching up the managing of the channels led to about a 25% improvement in performance by removing the blocking. My tests showed a drop from ~5s to ~3.5s for a large GetObjects scenario. About 2/3 of the time is the raw snowflake execution. which for the ADBC account is taking a total of around 2 - 3 seconds depending on the query for all of the SHOW queries + the primary one

Ah cool, the record reader handling is much cleaner now. Not sure why I did it that way originally.

Good catch on increasing the buffer size for the channel. I did think that could be a bottleneck which is why I didn't make it unbuffered, but didn't think it would be so significant. I also couldn't think of a value to use that didn't feel somewhat arbitrary. Maybe making it configurable or set to runtime.NumCPUs? Not critical but could be nice.

@zeroshade zeroshade force-pushed the fixup-metadata-getobjects-snowflake branch from e4e38e8 to 55a5c76 Compare October 15, 2024 16:07
@zeroshade
Copy link
Member Author

@joellubi I'll switch it to runtime.NumCPU and fix up the failing integration tests tomorrow. I'm part way through, snowflake is making me sad

return conn.Raw(func(driverConn any) (err error) {
query := "SHOW TERSE /* ADBC:getObjectsCatalogs */ DATABASES"
if catalog != nil && len(*catalog) > 0 && *catalog != "%" && *catalog != ".*" {
query += " LIKE '" + escapeSingleQuoteForLike(*catalog) + "'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will be a case sensitive search (LIKE) and will it also treat names with underscores as wildcards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LIKE keyword in the SHOW commands is actually case-insensitive according to the docs (https://docs.snowflake.com/en/sql-reference/sql/show-tables). But it does treat underscores like a LIKE comparison, though we do say in the docs that the arguments for "catalog" and such are treated as patterns if they include wildcards like _ and %.

@zeroshade
Copy link
Member Author

Finally got the unit tests and validation tests passing for this. Can I get one more review pass please?

c/validation/adbc_validation_connection.cc Outdated Show resolved Hide resolved
c/validation/adbc_validation_connection.cc Show resolved Hide resolved
c/validation/adbc_validation_connection.cc Outdated Show resolved Hide resolved
@@ -2180,15 +2180,15 @@ void StatementTest::TestSqlBind() {

ASSERT_THAT(
AdbcStatementSetSqlQuery(
&statement, "SELECT * FROM bindtest ORDER BY \"col1\" ASC NULLS FIRST", &error),
&statement, "SELECT * FROM bindtest ORDER BY col1 ASC NULLS FIRST", &error),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perhaps need a quirk for escaping column names?

(I also wouldn't be opposed to trying to make these tests more data-driven...I should go find time to sketch it out)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more about consistency. Our CREATE TABLE query earlier in this function doesn't quote the column names, so our select statement needs to also not quote the names. Almost everywhere else we quote the columns. We just need to be consistent.

That said, I agree with it would be awesome for these tests to be more data-driven.

@zeroshade zeroshade force-pushed the fixup-metadata-getobjects-snowflake branch from 50b302e to e26883f Compare October 17, 2024 18:00
@zeroshade zeroshade merged commit 5471d95 into main Oct 17, 2024
96 of 97 checks passed
@zeroshade zeroshade deleted the fixup-metadata-getobjects-snowflake branch October 17, 2024 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

snowflake: cannot call GetObjects with null catalog
5 participants