Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can hyper python API use multi-core? #99

Open
l1t1 opened this issue Sep 1, 2023 · 5 comments
Open

can hyper python API use multi-core? #99

l1t1 opened this issue Sep 1, 2023 · 5 comments

Comments

@l1t1
Copy link

l1t1 commented Sep 1, 2023

from tableauhyperapi import HyperProcess, Telemetry, Connection

with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    with Connection(endpoint=hyper.endpoint) as connection:
        import time
        t=time.time()
        a=connection.execute_scalar_query("select count(1) from 'd:/yellow_tripdata_2021-06.parquet'")
        print(a,time.time()-t)
        t=time.time()
        a=connection.execute_list_query("select passenger_count,count(1) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1")
        print(a,time.time()-t)
        t=time.time()
        a=connection.execute_list_query("select passenger_count,sum(trip_distance) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1")
        print(a,time.time()-t)

returns

2834264 0.18601059913635254
[[0, 66636], [1, 1968440], [2, 412798], [3, 108634], [4, 40950], [5, 67686], [6, 45562], [7, 11], [8, 5], [9, 4], [None, 123538]] 0.20101165771484375
[[0, 172554.11], [1, 5797179.629999995], [2, 1341309.7100000011], [3, 343928.14999999997], [4, 134748.31000000006], [5, 204493.66000000003], [6, 13989
3.91], [7, 33.44], [8, 9.17], [9, 0.0], [None, 11517949.330000013]] 0.2130122184753418

while duckdb CLI on same machine query same file

D select passenger_count,count(1) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1;
┌─────────────────┬──────────┐
│ passenger_count │ count(1) │
│      int32      │  int64   │
├─────────────────┼──────────┤
│               066636 │
│               11968440 │
│               2412798 │
│               3108634 │
│               440950 │
│               567686 │
│               645562 │
│               711 │
│               85 │
│               94 │
│                 │   123538 │
├─────────────────┴──────────┤
│ 11 rows          2 columns │
└────────────────────────────┘
Run Time (s): real 0.197 user 0.171601 sys 0.000000
D select passenger_count,count(1) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1;
┌─────────────────┬──────────┐
│ passenger_count │ count(1) │
│      int32      │  int64   │
├─────────────────┼──────────┤
│               066636 │
│               11968440 │
│               2412798 │
│               3108634 │
│               440950 │
│               567686 │
│               645562 │
│               711 │
│               85 │
│               94 │
│                 │   123538 │
├─────────────────┴──────────┤
│ 11 rows          2 columns │
└────────────────────────────┘
Run Time (s): real 0.074 user 0.156001 sys 0.046800
D select passenger_count,sum(trip_distance) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1
> ;
┌─────────────────┬────────────────────┐
│ passenger_count │ sum(trip_distance) │
│      int32      │       double       │
├─────────────────┼────────────────────┤
│               0172554.1099999999 │
│               15797179.629999994 │
│               21341309.7100000044 │
│               3343928.15000000084 │
│               4134748.30999999997 │
│               5204493.66000000027 │
│               6139893.91000000006 │
│               733.44 │
│               89.17 │
│               90.0 │
│                 │ 11517949.330000013 │
├─────────────────┴────────────────────┤
│ 11 rows                    2 columns │
└──────────────────────────────────────┘
Run Time (s): real 0.079 user 0.296402 sys 0.140401
@l1t1
Copy link
Author

l1t1 commented Sep 1, 2023

@vogelsgesang
Copy link
Contributor

vogelsgesang commented Sep 2, 2023

Thanks for bringing this up! Yes, Hyper can use multiple cores. In this particular case, the input set is so tiny, that Hyper will not benefit much from multi-threading, though. Hyper's full performance will only be unleashed on much bigger data sets than 17 megabyte. I would recommend testing Hyper with data sizes of at least a couple of gigabytes.

However, I guess your actual question is not about multi-core anyway. I guess you are rather wondering: "Why is Hyper slower than DuckDB on those queries?". Let's take a closer look at this 🙂

The trick is to use CREATE TEMPORARY EXTERNAL TABLE. The difference is:

  • With the adhoc-syntax FROM 'file/path.parquet, Hyper assumes that you will be accessing the file only once. Hence, Hyper will close and open the file multiple times. Each time, Hyper needs to rediscover the metadata on every query.
  • With CREATE EXTERNAL TABLE, you tell Hyper that you are planning to access the table multiple times. Hyper will keep the meta data about your file in cache.

Here is an updated benchmark:

import time
from tableauhyperapi import HyperProcess, Telemetry, Connection

with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    with Connection(endpoint=hyper.endpoint) as connection:
        a=connection.execute_command("CREATE TEMPORARY EXTERNAL TABLE tripdata FOR './yellow_tripdata_2021-06.parquet'")

        t=time.time()
        a=connection.execute_scalar_query("select count(1) from tripdata")
        print(time.time()-t, ": ", a)

        t=time.time()
        a=connection.execute_scalar_query("select count(1) from tripdata")
        print(time.time()-t, ": ", a)

        t=time.time()
        a=connection.execute_list_query("select passenger_count,count(1) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ": ", a)

        t=time.time()
        a=connection.execute_list_query("select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ": ", a)

Note how I first declared an external table, and then used it in the following queries.

This gives me the following numbers:

  • select count(1) from tripdata (2nd time)`: 0.521
  • select count(1) from tripdata (2nd time): 0.006 seconds
  • select passenger_count,count(1) from tripdata group by passenger_count order by 1: 0.100
  • select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1

Note how the first time we run the first query is rather slow. This is because Hyper computes some statistics on the external table the firs time you access it. Those statistics are important to Hyper's optimizer such that it will pick a good query plan. For the simple queries we are benchmarking here, those statistics won't make much of a difference, but for more complex join queries, those statistics are vital.

The updated performance numbers of Hyper are already much closer to DuckDB. Still slightly slower - we could tune Hyper further but I am not sure this would make sene. Your benchmark data is pretty small and Hyper is more tuned towards larger data sets. I would be interested in which performance your benchmark yields on larger data sets

@l1t1
Copy link
Author

l1t1 commented Sep 2, 2023

@vogelsgesang
thank you for your detailed reply, I learned a lot.
it's unfair to compare the python modules with native binaries too, one more question, does hyper database have the CLI? I found a hyperd server in the tableau desktop, but no client.

@l1t1
Copy link
Author

l1t1 commented Sep 4, 2023

test of python module duckdb

import time
import duckdb

duckdb.sql("CREATE view tripdata as select * from 'd:/yellow_tripdata_2021-06.parquet'")

if 1==1:
        t=time.time()
        a=duckdb.sql("select count(1) from tripdata")
        print(time.time()-t, ":\n", a)

        t=time.time()
        a=duckdb.sql("select count(1) from tripdata")
        print(time.time()-t, ":\n", a)

        t=time.time()
        a=duckdb.sql("select passenger_count,count(1) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ":\n", a)

        t=time.time()
        a=duckdb.sql("select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ":\n", a)

query on CREATE view of duckdb is faster than CREATE TEMPORARY EXTERNAL TABLE of hyper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants