How does Daft do logical optimization? #2626

Unanswered

htcd-subham asked this question in Q&A

htcd-subham
Aug 6, 2024

== Unoptimized Logical Plan ==

Filter: col(country) == lit("Canada")
|
GlobScanOperator
| Glob paths = [s3://daft-public-data/tutorials/10-min/sample-data-dog-owners-
| partitioned.pq/**]
| Coerce int96 timestamp unit = Nanoseconds
| IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000,
| Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry
| mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check
| hostname SSL = true, Requester pays = false, Force Virtual Addressing = false },
| Azure config = { Anonymous = false, Use SSL = true }, GCS config = { Anonymous =
| false }, HTTP config = { user_agent = daft/0.0.1 }
| Use multithreading = true
| File schema = first_name#Utf8, last_name#Utf8, age#Int64, DoB#Date,
| country#Utf8, has_dog#Boolean
| Partitioning keys = []
| Output schema = first_name#Utf8, last_name#Utf8, age#Int64, DoB#Date,
| country#Utf8, has_dog#Boolean

== Optimized Logical Plan ==

GlobScanOperator
| Glob paths = [s3://daft-public-data/tutorials/10-min/sample-data-dog-owners-
| partitioned.pq/**]
| Coerce int96 timestamp unit = Nanoseconds
| IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000,
| Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry
| mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check
| hostname SSL = true, Requester pays = false, Force Virtual Addressing = false },
| Azure config = { Anonymous = false, Use SSL = true }, GCS config = { Anonymous =
| false }, HTTP config = { user_agent = daft/0.0.1 }
| Use multithreading = true
| File schema = first_name#Utf8, last_name#Utf8, age#Int64, DoB#Date,
| country#Utf8, has_dog#Boolean
| Partitioning keys = []
| Filter pushdown = col(country) == lit("Canada")
| Output schema = first_name#Utf8, last_name#Utf8, age#Int64, DoB#Date,
| country#Utf8, has_dog#Boolean

== Physical Plan ==

TabularScan:
| Num Scan Tasks = 1
| Estimated Scan Bytes = 6336
| Clustering spec = { Num partitions = 1 }

How did Daft did this optimization?

Replies: 1 comment

jaychia
Aug 7, 2024
Maintainer

Hi! Daft does filter predicate pushdowns into the scan. This allows us to perform filtering as we read data, making it much more memory efficient :)

0 replies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment