[Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark #144

sbernauer · 2022-10-14T11:19:13Z

Description

Needs a larger k8s cluster! I use IONOS k8s with 9x 4 cores (8 threads), 20GB ram and 30GB hdd disk
Maybe we can also offer a smaller variant later on.

Otherwise business as usual. From feature-branch run stackablectl --additional-stacks-file stacks/stacks-v1.yaml --additional-releases-file releases.yaml --additional-demos-file demos/demos-v1.yaml demo install data-warehouse-iceberg-trino-spark

I'm not happy with some parts but i think an iterative approach is best:

Shared bikes are currently not streamed into Kafka (instead on-time job)
Some high-volume real-time datasource would be great. Currently we use the water levels and duplicate them to get higher volumes.
Some sort of Upsert or Deletion usecases would be great. But probably not on the large datasets as for our wallet ^^
Better Dashboards. The current one were thrown together quickly
I would like to partitions the water_level measurements by day but run into Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled apache/iceberg#5625. There might be ways around by using a dedicated Spark context for compaction but we can easily adopt the partitioning after the issue gets fixed. Sorting during rewrites did cost performance during compaction and did not provide real benefits for my intial measurements. Disabled for now.
As always tracked my findings in [Tracker] Findings of demos demos#15

To get to the Spark UI kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-warehouse-.*-driver') 4040

Review Checklist

Code contains useful comments
(Integration-)Test cases added (or not applicable)
Documentation added (or not applicable)
Changelog updated (or not applicable)
Cargo.toml only contains references to git tags (not specific commits or branches)

Once the review is done, comment bors r+ (or bors merge) to merge. Further information

maltesander

Nice. LGTM!

sbernauer · 2022-10-17T06:50:56Z

It kept running over the weekend. We are now at

|files|total_size     |avg_size   |total_records |avg_records|smallest_file|
|-----|---------------|-----------|--------------|-----------|-------------|
|298  |150,318,197,593|504,423,481|36,430,087,583|122,248,616|7,101,325    |

sbernauer · 2022-11-02T13:31:48Z

@maltesander could you please do another round reviewing the docs?

…stackabletech/stackablectl into demo-data-warehouse-iceberg-trino-spark

maltesander

Just english nitpicking!

docs/modules/ROOT/pages/demos/data-warehouse-iceberg-trino-spark.adoc

…rk.adoc Co-authored-by: Malte Sander <[email protected]>

Co-authored-by: Malte Sander <[email protected]>

maltesander

LGTM!

sbernauer · 2022-11-03T14:50:52Z

Many many thanks!

sbernauer · 2022-11-03T14:50:58Z

bors r+

## Description Needs a larger k8s cluster! I use IONOS k8s with 9x 4 cores (8 threads), 20GB ram and 30GB hdd disk Maybe we can also offer a smaller variant later on. Otherwise business as usual. From feature-branch run `stackablectl --additional-stacks-file stacks/stacks-v1.yaml --additional-releases-file releases.yaml --additional-demos-file demos/demos-v1.yaml demo install data-warehouse-iceberg-trino-spark` I'm not happy with some parts but i think an iterative approach is best: * Shared bikes are currently not streamed into Kafka (instead on-time job) * Some high-volume real-time datasource would be great. Currently we use the water levels and duplicate them to get higher volumes. * Some sort of Upsert or Deletion usecases would be great. But probably not on the large datasets as for our wallet ^^ * Better Dashboards. The current one were thrown together quickly * I would like to partitions the water_level measurements by day but run into apache/iceberg#5625. There might be ways around by using a dedicated Spark context for compaction but we can easily adopt the partitioning after the issue gets fixed. Sorting during rewrites did cost performance during compaction and did not provide real benefits for my intial measurements. Disabled for now. * As always tracked my findings in https://github.com/stackabletech/stackablectl/issues/128 To get to the Spark UI `kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-warehouse-.*-driver') 4040`

bors · 2022-11-03T14:52:12Z

Pull request successfully merged into main.

Build succeeded:

Run Rustfmt

sbernauer added 8 commits October 12, 2022 12:30

Add demo data-warehouse-iceberg-trino-spark

08be276

Fewer but larger batches

e69018b

docs

1419ed6

Add some smart city datasets

82c3afd

Microbatch every 5 mins

afd4f72

Add basic Superset dashboards

52ae7e2

Merge branch 'main' into demo-data-warehouse-iceberg-trino-spark

9abe4a1

Bump minio version

3062492

sbernauer self-assigned this Oct 14, 2022

maltesander approved these changes Oct 14, 2022

View reviewed changes

sbernauer added 8 commits October 17, 2022 12:24

Update dashboards

dc9e85b

Spark: Switch to unified resource struct after problems are fixed

620e926

Rename release to 22.09-latest-trino-spark

feccb37

Starting point of docs

20b9072

Merge branch 'main' into demo-data-warehouse-iceberg-trino-spark

d107c53

Update minio name

e70a65f

Add MinIO and NiFi to docs

64ca156

Docs on DBeaver

9538515

sbernauer force-pushed the demo-data-warehouse-iceberg-trino-spark branch from bf9d717 to 9538515 Compare October 28, 2022 08:44

sbernauer added 3 commits November 2, 2022 14:20

Docs on Superset

13ac8a3

Larger MinIO pvcs

dc95271

Merge branch 'main' into demo-data-warehouse-iceberg-trino-spark

c90f78e

sbernauer added 2 commits November 2, 2022 15:30

Mention iceberg

ce2a040

Merge branch 'demo-data-warehouse-iceberg-trino-spark' of github.com:…

81d8654

…stackabletech/stackablectl into demo-data-warehouse-iceberg-trino-spark

maltesander suggested changes Nov 2, 2022

View reviewed changes

sbernauer added 2 commits November 3, 2022 11:39

Fix minio svc name

d00bc9e

Add setup-superset.yaml

958f46f

sbernauer force-pushed the demo-data-warehouse-iceberg-trino-spark branch from dc3c913 to 958f46f Compare November 3, 2022 10:46

sbernauer and others added 7 commits November 3, 2022 12:53

Add view warehouse.smart_city.shared_bikes_station_status_joined

dd9fd28

Update docs/modules/ROOT/pages/demos/data-warehouse-iceberg-trino-spa…

4abc206

…rk.adoc Co-authored-by: Malte Sander <[email protected]>

Update docs/modules/ROOT/pages/demos/data-warehouse-iceberg-trino-spa…

6d72cc3

…rk.adoc Co-authored-by: Malte Sander <[email protected]>

Apply suggestions from code review

7a0b592

Co-authored-by: Malte Sander <[email protected]>

wording

176ac1c

wording

28f4ed3

wording

668efc7

sbernauer mentioned this pull request Nov 3, 2022

Documentation: How to connect via DBeaver stackabletech/trino-operator#321

Closed

1 task

sbernauer requested a review from maltesander November 3, 2022 13:04

typo

ff742ac

maltesander approved these changes Nov 3, 2022

View reviewed changes

bors bot changed the title ~~Add demo data-warehouse-iceberg-trino-spark~~ [Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark Nov 3, 2022

bors bot closed this Nov 3, 2022

bors bot deleted the demo-data-warehouse-iceberg-trino-spark branch November 3, 2022 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark #144

[Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark #144

sbernauer commented Oct 14, 2022 •

edited

Loading

maltesander left a comment

sbernauer commented Oct 17, 2022 •

edited

Loading

sbernauer commented Nov 2, 2022

maltesander left a comment

maltesander left a comment

sbernauer commented Nov 3, 2022

sbernauer commented Nov 3, 2022

bors bot commented Nov 3, 2022

[Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark #144

[Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark #144

Conversation

sbernauer commented Oct 14, 2022 • edited Loading

Description

Review Checklist

maltesander left a comment

Choose a reason for hiding this comment

sbernauer commented Oct 17, 2022 • edited Loading

sbernauer commented Nov 2, 2022

maltesander left a comment

Choose a reason for hiding this comment

maltesander left a comment

Choose a reason for hiding this comment

sbernauer commented Nov 3, 2022

sbernauer commented Nov 3, 2022

bors bot commented Nov 3, 2022

sbernauer commented Oct 14, 2022 •

edited

Loading

sbernauer commented Oct 17, 2022 •

edited

Loading