Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-H query 9 via SQL with DuckDB with Parquet files oddity #65

Open
jonkeane opened this issue Dec 20, 2021 · 0 comments
Open

TPC-H query 9 via SQL with DuckDB with Parquet files oddity #65

jonkeane opened this issue Dec 20, 2021 · 0 comments

Comments

@jonkeane
Copy link
Contributor

When using DuckDB SQL (against Views of parquet files), for TPC-H query 9 at scale factor 1 the ordering is off. USA and UK are interleaved, when they should actually be UK then USA.

result vs answer
                        nation o_year sum_profit
  result[152, ]   SAUDI ARABIA   1994   48181673
  result[153, ]   SAUDI ARABIA   1993   45692556
  result[154, ]   SAUDI ARABIA   1992   48924913
- result[155, ]  UNITED STATES   1998   27826594
+ answer[155, ] UNITED KINGDOM   1998   26366683
- result[156, ] UNITED KINGDOM   1998   26366683
+ answer[156, ] UNITED KINGDOM   1997   44518130
- result[157, ] UNITED KINGDOM   1997   44518130
+ answer[157, ] UNITED KINGDOM   1996   45539730
- result[158, ]  UNITED STATES   1997   46638572
+ answer[158, ] UNITED KINGDOM   1995   46845879
- result[159, ]  UNITED STATES   1996   46688281
+ answer[159, ] UNITED KINGDOM   1994   43081610
- result[160, ] UNITED KINGDOM   1996   45539730
+ answer[160, ] UNITED KINGDOM   1993   44770147
- result[161, ]  UNITED STATES   1995   48951592
+ answer[161, ] UNITED KINGDOM   1992   44123403
- result[162, ] UNITED KINGDOM   1995   46845879
+ answer[162, ]  UNITED STATES   1998   27826594
- result[163, ]  UNITED STATES   1994   45099092
+ answer[163, ]  UNITED STATES   1997   46638572
- result[164, ] UNITED KINGDOM   1994   43081610
+ answer[164, ]  UNITED STATES   1996   46688281
- result[165, ] UNITED KINGDOM   1993   44770147
+ answer[165, ]  UNITED STATES   1995   48951592
- result[166, ]  UNITED STATES   1993   46181601
+ answer[166, ]  UNITED STATES   1994   45099092
- result[167, ]  UNITED STATES   1992   46168214
+ answer[167, ]  UNITED STATES   1993   46181601
- result[168, ] UNITED KINGDOM   1992   44123403
+ answer[168, ]  UNITED STATES   1992   46168214
  result[169, ]        VIETNAM   1998   27281931
  result[170, ]        VIETNAM   1997   48735914
  result[171, ]        VIETNAM   1996   47824596
library(duckdb)
library(arrowbench)

query_id <- 9
format <- "parquet"
scale_factor <- 1

Sys.setenv(ARROWBENCH_LOCAL_DIR = "~/repos/ab_store/het")

tpch_files <- ensure_source("tpch", scale_factor = scale_factor)

tpch_tables_needed <- tables_refed(tpc_h_queries[[query_id]])

con <- dbConnect(duckdb())

# set parallelism for duckdb
DBI::dbExecute(con, paste0("PRAGMA threads=10"))

for (name in tpch_tables_needed) {
  file <- path.expand(tpch_files[[name]])

  # have to create a VIEW in order to reference it by name
  # This view is the most accurate comparison to Arrow, however it will
  # penalize duckdb since AFAICT `parquet_scan` is not parallelized and
  # ends up being the bottleneck
  if (format == "parquet") {
    sql_query <- paste0("CREATE OR REPLACE VIEW ", name, " AS SELECT * FROM parquet_scan('", file, "');")
  } else if (format == "native") {
    sql_query <- paste0("CREATE TABLE IF NOT EXISTS ", name, " AS SELECT * FROM parquet_scan('", file, "');")
  }

  DBI::dbExecute(con, sql_query)
}

tpc_h_queries <- list()
# TODO: should this actually just be query_id?
queries <- 1:22

# get the queries
tpc_h_queries <- lapply(queries, get_sql_query_func, con = con)

names(tpc_h_queries) <- queries

result <- tpc_h_queries[[query_id]](con = con)

result <- dplyr::as_tibble(result)

answer <- tpch_answer(scale_factor, query_id)

waldo::compare(result, answer, x_arg = "result", y_arg = "answer")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant