You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. I'm developing Apache Beam Python I/O connectors that utilises your integration features of SQL and open table format sources (and possibly sinks). Following the I/O connector development guide, I'll be building Splittable DoFn objects, and they require a complete set of SQL queries as per partition conditions (eg partition_col, num_partitons...). Then, those queries are executed in multiple instances of DoFn objects followed by combining the results before ingesting into subsequent tasks.
For SQL sources as an example, the physical plan scheduler prints SQL queries but they don't seem to be available in Python. I might be able to obtain the queries using the sql scan operator but it looks incomplete because the pushdowns object is not directly obtainable in python. In this regards, it'll be good if a complete SQL queries can be obtained in python.
Describe the solution you'd like
I guess the physical plan schedule is available for both SQL and open table format sources. Although it includes SQL queries but they cannot be obtained in python. It'll be good if there is a to_dict() method or similar that support to obtain a complete SQL queries.
USER_STMT="SELECT id, first_name, last_name, email FROM staging.users"## not fully evaluateddf=daft.read_sql(
sql=USER_STMT, conn=create_connection, partition_col="id", num_partitions=9
)
## physical plan schedule should exist for sql and open table formatsphysical_plan_scheduler=df._builder.to_physical_plan_scheduler(
daft.context.get_context().daft_execution_config
)
## can we have a to_dict method or another that can obtain a complete sql queries?physical_plan_scheduler# * TabularScan:# | Num Scan Tasks = 9# | Estimated Scan Bytes = 685008# | Clustering spec = { Num partitions = 9 }# | SQL Queries = [SELECT * FROM (SELECT id, first_name, last_name, email FROM staging.users) AS subquery WHERE id >= 1 AND id < 1112,..]# | Schema: {id#Int64, first_name#Utf8, last_name#Utf8, email#Utf8}# | Scan Tasks: [# | {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}# | {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}# | {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}# | ...# | {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}# | {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}# | {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}# | ]
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Hey @jaehyeon-kim ! Sorry for the late update, I think the simplest way for us to do this is to serialize the physical plan into JSON, then you would be able to extract the sql queries from the JSON. PR is here: #3023
Is your feature request related to a problem? Please describe.
Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. I'm developing Apache Beam Python I/O connectors that utilises your integration features of SQL and open table format sources (and possibly sinks). Following the I/O connector development guide, I'll be building
Splittable DoFn
objects, and they require a complete set of SQL queries as per partition conditions (eg partition_col, num_partitons...). Then, those queries are executed in multiple instances ofDoFn
objects followed by combining the results before ingesting into subsequent tasks.For SQL sources as an example, the physical plan scheduler prints SQL queries but they don't seem to be available in Python. I might be able to obtain the queries using the sql scan operator but it looks incomplete because the pushdowns object is not directly obtainable in python. In this regards, it'll be good if a complete SQL queries can be obtained in python.
Describe the solution you'd like
I guess the physical plan schedule is available for both SQL and open table format sources. Although it includes SQL queries but they cannot be obtained in python. It'll be good if there is a
to_dict()
method or similar that support to obtain a complete SQL queries.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: