Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] read_sql #1943

Merged
merged 30 commits into from
Mar 13, 2024
Merged

[FEAT] read_sql #1943

merged 30 commits into from
Mar 13, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Feb 22, 2024

Closes #1560

Adds a new read method: read_sql(sql: str, url: str), which executes a given sql query on a given database url, and creates a Dataframe from the results.

Drive bys:

  • Added a from_arrow cast from pyarrow date64 to daft timestamp (both are millisecond precision) to allow connectorx reads with timestamp columns.

Features:

  • Uses connector-x, which reads directly into pyarrow via rust when supported, else fallback to SQL alchemy
  • Partitioned reads via limit and offset when supported
  • Integration tests for Postgres, MySQL, Trino
  • Pushdowns into base SQL query

@colin-ho colin-ho changed the title [FEAT] Read_SQL [FEAT] read_sql Feb 22, 2024
@github-actions github-actions bot added the enhancement New feature or request label Feb 22, 2024
Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 31.73077% with 142 lines in your changes are missing coverage. Please review.

Project coverage is 82.70%. Comparing base (bd78fa8) to head (bd65b05).
Report is 9 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1943      +/-   ##
==========================================
- Coverage   84.67%   82.70%   -1.98%     
==========================================
  Files          58       62       +4     
  Lines        6363     6615     +252     
==========================================
+ Hits         5388     5471      +83     
- Misses        975     1144     +169     
Files Coverage Δ
daft/__init__.py 24.24% <ø> (ø)
daft/context.py 76.22% <ø> (ø)
daft/expressions/expressions.py 91.46% <100.00%> (-0.04%) ⬇️
daft/io/__init__.py 95.45% <100.00%> (+0.21%) ⬆️
daft/runners/partitioning.py 82.15% <100.00%> (+0.25%) ⬆️
daft/datatype.py 92.01% <75.00%> (-0.23%) ⬇️
daft/logical/schema.py 91.22% <50.00%> (-0.74%) ⬇️
daft/io/_sql.py 52.94% <52.94%> (ø)
daft/table/table_io.py 89.95% <13.33%> (-5.38%) ⬇️
daft/sql/sql_reader.py 22.41% <22.41%> (ø)
... and 1 more

... and 6 files with indirect coverage changes

daft/sql/sql_reader.py Outdated Show resolved Hide resolved
daft/sql/sql_reader.py Outdated Show resolved Hide resolved
daft/sql/sql_reader.py Outdated Show resolved Hide resolved
daft/sql/sql_reader.py Show resolved Hide resolved
daft/sql/sql_reader.py Outdated Show resolved Hide resolved
src/daft-dsl/src/expr.rs Outdated Show resolved Hide resolved
src/daft-dsl/src/expr.rs Outdated Show resolved Hide resolved
src/daft-scan/src/python.rs Outdated Show resolved Hide resolved
src/daft-scan/src/python.rs Outdated Show resolved Hide resolved
tests/integration/sql/conftest.py Outdated Show resolved Hide resolved
@colin-ho colin-ho requested a review from samster25 March 1, 2024 18:20
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 1, 2024
daft/sql/sql_reader.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored our creation docs as a drive by, before we had arrow, pandas, and file paths in a separate in-memory section

Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just some nits :)

rows = result.fetchall()
pydict = {column_name: [row[i] for row in rows] for i, column_name in enumerate(result.keys())}

return pa.Table.from_pydict(pydict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also want to pass in the schema into from_pydict otherwise we are relying on the type inference of pyarrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an issue to create type mappings from dbapi type codes

pa_table = SQLReader(self.sql, self.url, limit=1).read()
schema = Schema.from_pyarrow_schema(pa_table.schema)
return schema
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid catch all exception handling! What errors are you expecting here?

schema = Schema.from_pyarrow_schema(pa_table.schema)
return schema
except Exception:
# If limit fails, try to read the entire table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob should log a warning if running it again

self.url,
projection=["COUNT(*)"],
).read()
return pa_table.column(0)[0].as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should put some checks to ensure that there is 1 column and 1 row (and raising an error) before indexing into it. This would lead to hard to debug stack traces for the end user!

for i, percentile in enumerate(percentiles)
],
).read()
bounds = [pa_table.column(i)[0].as_py() for i in range(num_scan_tasks - 1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perform checks that raise errors to ensure expected output before indexing into it

Expr::Alias(inner, ..) => to_sql_inner(inner, buffer),
Expr::BinaryOp { op, left, right } => {
to_sql_inner(left, buffer)?;
let op = match op {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just write directly into the buffer rather than converting to a string first (causes heap allocation) and then writing to the buffer

to_sql_inner(right, buffer)
}
Expr::Not(inner) => {
write!(buffer, "NOT (")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to collapse this
write!(buffer, "NOT ({})", o_sql_inner(inner, buffer)?)

if_false,
predicate,
} => {
write!(buffer, "CASE WHEN ")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be able to be a single write! macro

@@ -212,6 +212,36 @@ impl LiteralValue {
};
result
}

pub fn to_sql(&self) -> Option<String> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to implement but you could have this be display_sql and take in a formatter as well!

size_bytes,
metadata: num_rows.map(|n| TableMetadata { length: n as usize }),
partition_spec: None,
statistics: None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the table is partitioned by some column, we could leverage that for statistics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make an issue for this to do as a follow on.

@colin-ho
Copy link
Contributor Author

Nice work! Just some nits :)

Thanks! Addressed your feedback in latest commit. I also added functionality to insert limit pushdowns into SQL, as well as some general refactoring.

Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@colin-ho colin-ho merged commit b6b0a2e into main Mar 13, 2024
30 of 32 checks passed
@colin-ho colin-ho deleted the colin/read-sql branch March 13, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL to Daft
2 participants