Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance windowframe #65

Merged
merged 4 commits into from
Sep 29, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,10 @@ TidierDB.jl currently supports the following top-level macros:
| **Helper Functions** | `across`, `desc`, `if_else`, `case_when`, `n`, `starts_with`, `ends_with`, `contains`, `as_float`, `as_integer`, `as_string`, `is_missing`, `missing_if`, `replace_missing` |
| **TidierStrings.jl Functions** | `str_detect`, `str_replace`, `str_replace_all`, `str_remove_all`, `str_remove` |
| **TidierDates.jl Functions** | `year`, `month`, `day`, `hour`, `min`, `second`, `floor_date`, `difftime` |
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, `cor`, `cov`, `var`,
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, `cor`, `cov`, `var`, all SQL aggregate

`@summarize` supports any SQL aggregate function in addition to the list above. Simply write the function as written in SQL syntax and it will work. |
`@summarize` supports any SQL aggregate function in addition to the list above. Simply write the function as written in SQL syntax and it will work.
`@mutate` supports nearly all functions built into SQL as well.

When using the DuckDB backend, if `db_table` recieves a file path (`.parquet`, `.json`, `.csv`, `iceberg` or `delta`), it does not copy it into memory. This allows for queries on files too big for memory. `db_table` also supports S3 bucket locations via DuckDB.

Expand All @@ -67,7 +68,9 @@ import TidierDB as DB
db = DB.connect(DB.duckdb());
path_or_name = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"

@chain DB.db_table(db, path_or_name) begin
mtcars = DB.db_table(db, path_or_name);

@chain DB.t(mtcars) begin
DB.@filter(!starts_with(model, "M"))
DB.@group_by(cyl)
DB.@summarize(mpg = mean(mpg))
Expand Down Expand Up @@ -97,7 +100,7 @@ end
We cannot do this using TidierDB. However, we can call `@pivot_longer()` from TidierData *after* the result of the query has been instantiated as a DataFrame, like this:

```julia
@chain DB.db_table(db, path_or_name) begin
@chain DB.t(mtcars) begin
DB.@filter(!starts_with(model, "M"))
DB.@group_by(cyl)
DB.@summarize(mpg = mean(mpg))
Expand Down Expand Up @@ -136,7 +139,7 @@ end
We can replace `DB.collect()` with `DB.@show_query` to reveal the underlying SQL query being generated by TidierDB. To handle complex queries, TidierDB makes heavy use of Common Table Expressions (CTE), which are a useful tool to organize long queries.

```julia
@chain DB.db_table(db, path_or_name) begin
@chain DB.t(mtcars) begin
DB.@filter(!starts_with(model, "M"))
DB.@group_by(cyl)
DB.@summarize(mpg = mean(mpg))
Expand Down Expand Up @@ -176,7 +179,7 @@ SELECT *
## TidierDB is already quite fully-featured, supporting advanced TidierData functions like `across()` for multi-column selection.

```julia
@chain DB.db_table(db, path_or_name) begin
@chain DB.t(mtcars) begin
DB.@group_by(cyl)
DB.@summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
DB.@collect
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/UserGuide/from_queryex.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# While using TidierDB, you may need to generate part of a query and reuse it multiple times. `from_query()` enables a query portion to be reused multiple times as shown below.
# While using TidierDB, you may need to generate part of a query and reuse it multiple times. `from_query()` or `t()` enable a query portion to be reused multiple times as shown below.

# ```julia
# import TidierDB as DB
Expand Down
4 changes: 2 additions & 2 deletions docs/examples/UserGuide/getting_started.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,10 @@
# If you are working with a backend where compute cost is important, it will be important to minimize using `db_table` as this will requery for metadata each time.
# Compute costs are relevant to backends such as AWS, databricks and Snowflake.

# To do this, save the results of `db_table` and use them with `from_query`. Using `from_query` pulls the relevant information (metadata, con, etc) from the mutable SQLquery struct, allowing you to repeatedly query and collect the table without requerying for the metadata each time
# To do this, save the results of `db_table` and use them with `t`. Using `t` pulls the relevant information (metadata, con, etc) from the mutable SQLquery struct, allowing you to repeatedly query and collect the table without requerying for the metadata each time
# ```julia
# table = DB.db_table(con, "path")
# @chain DB.from_query(table) begin
# @chain DB.t(table) begin
# ## data wrangling here
# end
# ```
Expand Down
2 changes: 0 additions & 2 deletions docs/examples/UserGuide/ibis_comp.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@
# ```julia
# using TidierDB
# db = connect(duckdb())
# # This next line is optional, but it will let us avoid writing `db_table` or `from_query` for each query
# t(table) = from_query(table)
# ```
# Of note, TidierDB does not yet have an "interactive mode" so each example result will be collected.

Expand Down
5 changes: 3 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,10 @@ TidierDB.jl currently supports:
| **Helper Functions** | `across`, `desc`, `if_else`, `case_when`, `n`, `starts_with`, `ends_with`, `contains`, `as_float`, `as_integer`, `as_string`, `is_missing`, `missing_if`, `replace_missing` |
| **TidierStrings.jl Functions** | `str_detect`, `str_replace`, `str_replace_all`, `str_remove_all`, `str_remove` |
| **TidierDates.jl Functions** | `year`, `month`, `day`, `hour`, `min`, `second`, `floor_date`, `difftime` |
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, `cor`, `cov`, `var`,
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, `cor`, `cov`, `var`, all SQL aggregate

`@summarize` supports any SQL aggregate function in addition to the list above. Simply write the function as written in SQL syntax and it will work. |
`@summarize` supports any SQL aggregate function in addition to the list above. Simply write the function as written in SQL syntax and it will work.
`@mutate` supports nearly all functions built into SQL as well.


When using the DuckDB backend, if `db_table` recieves a file path ( `.parquet`, `.json`, `.csv`, `iceberg` or `delta`), it does not copy it into memory. This allows for queries on files too big for memory. `db_table` also supports S3 bucket locations via DuckDB.
Expand Down
86 changes: 0 additions & 86 deletions src/TBD_macros.jl
Original file line number Diff line number Diff line change
Expand Up @@ -552,92 +552,6 @@ macro rename(sqlquery, renamings...)
end
end

"""
$docstring_window_order
"""
macro window_order(sqlquery, order_by_expr...)
order_by_expr = parse_interpolation2.(order_by_expr)

return quote
sq = $(esc(sqlquery))
if isa(sq, SQLQuery)
# Convert order_by_expr to SQL order by string
order_by_sql = join([expr_to_sql(expr, sq) for expr in $(esc(order_by_expr))], ", ")

# Update the orderBy field of the SQLQuery instance
sq.window_order = order_by_sql

# If this is the first operation after an aggregation, wrap current state in a CTE
if sq.post_aggregation
sq.post_aggregation = false
cte_name = "cte_" * string(sq.cte_count + 1)
cte_sql = "SELECT * FROM " * sq.from

if !isempty(sq.where)
cte_sql *= " WHERE " * sq.where
end

new_cte = CTE(name=cte_name, select=cte_sql, from=sq.from)
push!(sq.ctes, new_cte)
sq.cte_count += 1

# Reset the from to reference the new CTE
sq.from = cte_name
end

# Note: Actual window functions would be applied in subsequent @mutate calls or similar,
# potentially using the orderBy set here for their OVER() clauses.
else
error("Expected sqlquery to be an instance of SQLQuery")
end
sq
end
end

"""
$docstring_window_frame
"""
macro window_frame(sqlquery, frame_start::Int, frame_end::Int)
return quote
sq = $(esc(sqlquery))
if isa(sq, SQLQuery)
# Validate frame_start and frame_end
if $frame_end < $frame_start
error("frame_end must be greater than or equal to frame_start")
end

# Calculate absolute values for frame_start and frame_end
abs_frame_start = abs($frame_start)
abs_frame_end = abs($frame_end)

# Determine the direction and clause for frame_start
frame_start_clause = if $frame_start < 0
string(abs_frame_start, " PRECEDING")
else
string(abs_frame_start, " FOLLOWING")
end

# Determine the direction and clause for frame_end
frame_end_clause = if $frame_end < 0
string(abs_frame_end, " PRECEDING")
else
string(abs_frame_end, " FOLLOWING")
end

# Construct the window frame clause
frame_clause = string("ROWS BETWEEN ", frame_start_clause, " AND ", frame_end_clause)

# Update the windowFrame field of the SQLQuery instance
sq.windowFrame = frame_clause
else
error("Expected sqlquery to be an instance of SQLQuery")
end
sq
end
end




macro show_query(sqlquery)
return quote
Expand Down
7 changes: 4 additions & 3 deletions src/TidierDB.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ using GZip
@distinct, @left_join, @right_join, @inner_join, @count, @window_order, @window_frame, @show_query, @collect, @slice_max,
@slice_min, @slice_sample, @rename, copy_to, duckdb_open, duckdb_connect, @semi_join, @full_join,
@anti_join, connect, from_query, @interpolate, add_interp_parameter!, update_con, @head,
clickhouse, duckdb, sqlite, mysql, mssql, postgres, athena, snowflake, gbq, oracle, databricks, SQLQuery, show_tables, t
clickhouse, duckdb, sqlite, mysql, mssql, postgres, athena, snowflake, gbq, oracle, databricks, SQLQuery, show_tables,
t

abstract type SQLBackend end

Expand Down Expand Up @@ -58,7 +59,7 @@ include("parsing_oracle.jl")
include("parsing_databricks.jl")
include("joins_sq.jl")
include("slices_sq.jl")

include("windows.jl")



Expand Down Expand Up @@ -413,7 +414,7 @@ end
function connect(::duckdb, token::String)
if token == "md:"
return DBInterface.connect(DuckDB.DB, "md:")
elseif endswith(token, ".duckdb")
elseif endswith(token, ".duckdb") || endswith(token, ".duck.db")
drizk1 marked this conversation as resolved.
Show resolved Hide resolved
return DuckDB.DB(token)
else
return DBInterface.connect(DuckDB.DB, "md:$token")
Expand Down
56 changes: 49 additions & 7 deletions src/docstrings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ julia> db = connect(duckdb());

julia> copy_to(db, df, "df_mem");

julia> @chain db_table(db, :df_mem) begin
julia> df_mem = db_table(db, :df_mem);

julia> @chain t(df_mem) begin
@select(groups:percent)
@collect
end
Expand All @@ -39,7 +41,7 @@ julia> @chain db_table(db, :df_mem) begin
9 │ bb 4 0.9
10 │ aa 5 1.0

julia> @chain db_table(db, :df_mem) begin
julia> @chain t(df_mem) begin
@select(contains("e"))
@collect
end
Expand Down Expand Up @@ -925,20 +927,30 @@ julia> df = DataFrame(id = [string('A' + i ÷ 26, 'A' + i % 26) for i in 0:9],
julia> db = connect(duckdb());

julia> copy_to(db, df, "df_mem");

julia> @chain db_table(db, :df_mem) begin
@group_by groups
@window_frame(3)
@window_order(desc(percent))
@mutate(avg = mean(value))
#@show_query
end;
```
"""

const docstring_window_frame =
"""
@window_frame(sql_query, frame_start::Int, frame_end::Int)
@window_frame(sql_query, args...)

Define the window frame for window functions in a SQL query, specifying the range of rows to include in the calculation relative to the current row.

# Arguments
sql_query: The SQL query to operate on, expected to be an instance of SQLQuery.
- `frame_start`: The starting point of the window frame. A positive value indicates the start after the current row (FOLLOWING), a negative value indicates before the current row (PRECEDING), and 0 indicates the current row.
- `frame_end`: The ending point of the window frame. A positive value indicates the end after the current row (FOLLOWING), a negative value indicates before the current row (PRECEDING), and 0 indicates the current row.

- `sqlquery::SQLQuery`: The SQLQuery instance to which the window frame will be applied.
- `args...`: A variable number of arguments specifying the frame boundaries. These can be:
- `from`: The starting point of the frame. Can be a positive or negative integer, 0 or empty. When empty, it will use UNBOUNDED
- `to`: The ending point of the frame. Can be a positive or negative integer, 0 or empty. When empty, it will use UNBOUNDED
- if only one integer is provided without specifying `to` or `from` it will default to from, and to will be UNBOUNDED.
- if no arguments are given, both will be UNBOUNDED
# Examples
```jldoctest
julia> df = DataFrame(id = [string('A' + i ÷ 26, 'A' + i % 26) for i in 0:9],
Expand All @@ -949,6 +961,36 @@ julia> df = DataFrame(id = [string('A' + i ÷ 26, 'A' + i % 26) for i in 0:9],
julia> db = connect(duckdb());

julia> copy_to(db, df, "df_mem");

julia> df_mem = db_table(db, :df_mem);

julia> @chain t(df_mem) begin
@group_by groups
@window_frame(3)
@mutate(avg = mean(percent))
#@show_query
end;

julia> @chain t(df_mem) begin
@group_by groups
@window_frame(-3, 3)
@mutate(avg = mean(percent))
#@show_query
end;

julia> @chain t(df_mem) begin
@group_by groups
@window_frame(to = -3)
@mutate(avg = mean(percent))
#@show_query
end;

julia> @chain t(df_mem) begin
@group_by groups
@window_frame()
@mutate(avg = mean(percent))
#@show_query
end;
```
"""

Expand Down
Loading