Skip to content

Commit

Permalink
fix(c/driver/sqlite): Wrap bulk ingests in a single begin/commit txn (#…
Browse files Browse the repository at this point in the history
…910)

Fixes #466 by having a single begin/commit txn for ingesting tables,
instead of committing once per row.

# Testing
## R
I first installed the [SQLite R
driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and
measured the time it takes to bulk ingest the `nycflights13` dataset,
noting the 336776 rowcount:
```r
library(adbcdrivermanager)
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
con <- adbc_connection_init(db)

flights <- nycflights13::flights
flights$time_hour <- NULL

stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
adbc_statement_bind(stmt, flights)

start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
```
As well as the time it takes to execute the query:
```r
end_time - start_time
Time difference of 1.711345 mins
```

As a followup, I noticed it takes significantly longer (~30 minutes) to
execute the query on my XPS 15 9520:
- Ubuntu 22.04.2
- kernel 5.19.0
- i9-12900HK @ 4.90 GHz
- Intel Alder Lake-P
- 64 GB RAM

Compared to my Macbook Air (1.711345 minutes):
- macOS Ventura 13.4.1
- Apple M2
- 16 GB RAM

Both on the same version of R 4.3.1 (Beagle Scout).

After making my changes, I ran `R CMD INSTALL . --preclean` in
`arrow-adbc/r/adbcdrivermanager`. I also installed the following R
packages:
```r
# install.packages("devtools")
# install.packages("pkgbuild")
```

After which I ran the following commands to validate no build / compile
issues showing up as R packaged my changes:
```r
devtools::build()
devtools::check()
devtools::install()
```
Noting that the file I made changes to showed up as a vendored file:
```r
Vendoring files from arrow-adbc to src/:
- ../../adbc.h
- ../../c/driver/sqlite/sqlite.c
- ../../c/driver/sqlite/statement_reader.c
- ../../c/driver/sqlite/statement_reader.h
- ../../c/driver/sqlite/types.h
- ../../c/driver/common/utils.c
- ../../c/driver/common/utils.h
- ../../c/vendor/nanoarrow/nanoarrow.h
- ../../c/vendor/nanoarrow/nanoarrow.c
- ../../c/vendor/sqlite3/sqlite3.h
- ../../c/vendor/sqlite3/sqlite3.c
All files successfully copied to src/
```

After packaging and installing my changes, I ran through the same bulk
ingest commands for the `nycflights13` dataset and verified that the
table contained the same number of rows as the previous run, also noting
the speedup from 1.7 minutes to 0.2 seconds:
```r
...
> start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
> end_time - start_time
Time difference of 0.2236128 secs
```
  • Loading branch information
ywc88 authored Jul 25, 2023
1 parent b8b8e45 commit 4394686
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions c/driver/sqlite/sqlite.c
Original file line number Diff line number Diff line change
Expand Up @@ -1101,7 +1101,10 @@ AdbcStatusCode SqliteStatementExecuteIngest(struct SqliteStatement* stmt,
AdbcStatusCode status = SqliteStatementInitIngest(stmt, &insert, error);

int64_t row_count = 0;
int is_autocommit = sqlite3_get_autocommit(stmt->conn);
if (status == ADBC_STATUS_OK) {
if (is_autocommit) sqlite3_exec(stmt->conn, "BEGIN TRANSACTION", 0, 0, 0);

while (1) {
char finished = 0;
status =
Expand All @@ -1120,6 +1123,8 @@ AdbcStatusCode SqliteStatementExecuteIngest(struct SqliteStatement* stmt,
}
row_count++;
}

if (is_autocommit) sqlite3_exec(stmt->conn, "COMMIT", 0, 0, 0);
}

if (rows_affected) *rows_affected = row_count;
Expand Down

0 comments on commit 4394686

Please sign in to comment.