Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

[Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark #144

Closed
wants to merge 31 commits into from

Conversation

sbernauer
Copy link
Member

@sbernauer sbernauer commented Oct 14, 2022

Description

Needs a larger k8s cluster! I use IONOS k8s with 9x 4 cores (8 threads), 20GB ram and 30GB hdd disk
Maybe we can also offer a smaller variant later on.

Otherwise business as usual. From feature-branch run stackablectl --additional-stacks-file stacks/stacks-v1.yaml --additional-releases-file releases.yaml --additional-demos-file demos/demos-v1.yaml demo install data-warehouse-iceberg-trino-spark

I'm not happy with some parts but i think an iterative approach is best:

  • Shared bikes are currently not streamed into Kafka (instead on-time job)
  • Some high-volume real-time datasource would be great. Currently we use the water levels and duplicate them to get higher volumes.
  • Some sort of Upsert or Deletion usecases would be great. But probably not on the large datasets as for our wallet ^^
  • Better Dashboards. The current one were thrown together quickly
  • I would like to partitions the water_level measurements by day but run into Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled apache/iceberg#5625. There might be ways around by using a dedicated Spark context for compaction but we can easily adopt the partitioning after the issue gets fixed. Sorting during rewrites did cost performance during compaction and did not provide real benefits for my intial measurements. Disabled for now.
  • As always tracked my findings in [Tracker] Findings of demos demos#15

To get to the Spark UI kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-warehouse-.*-driver') 4040

Review Checklist

  • Code contains useful comments
  • (Integration-)Test cases added (or not applicable)
  • Documentation added (or not applicable)
  • Changelog updated (or not applicable)
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Once the review is done, comment bors r+ (or bors merge) to merge. Further information

@sbernauer sbernauer self-assigned this Oct 14, 2022
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. LGTM!

@sbernauer
Copy link
Member Author

sbernauer commented Oct 17, 2022

It kept running over the weekend. We are now at

|files|total_size     |avg_size   |total_records |avg_records|smallest_file|
|-----|---------------|-----------|--------------|-----------|-------------|
|298  |150,318,197,593|504,423,481|36,430,087,583|122,248,616|7,101,325    |

@sbernauer sbernauer force-pushed the demo-data-warehouse-iceberg-trino-spark branch from bf9d717 to 9538515 Compare October 28, 2022 08:44
@sbernauer
Copy link
Member Author

@maltesander could you please do another round reviewing the docs?

…stackabletech/stackablectl into demo-data-warehouse-iceberg-trino-spark
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just english nitpicking!

@sbernauer sbernauer force-pushed the demo-data-warehouse-iceberg-trino-spark branch from dc3c913 to 958f46f Compare November 3, 2022 10:46
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sbernauer
Copy link
Member Author

Many many thanks!

@sbernauer
Copy link
Member Author

bors r+

bors bot pushed a commit that referenced this pull request Nov 3, 2022
## Description

Needs a larger k8s cluster! I use IONOS k8s with 9x 4 cores (8 threads), 20GB ram and 30GB hdd disk
Maybe we can also offer a smaller variant later on.

Otherwise business as usual. From feature-branch run `stackablectl --additional-stacks-file stacks/stacks-v1.yaml --additional-releases-file releases.yaml --additional-demos-file demos/demos-v1.yaml demo install data-warehouse-iceberg-trino-spark`

I'm not happy with some parts but i think an iterative approach is best:
* Shared bikes are currently not streamed into Kafka (instead on-time job)
* Some high-volume real-time datasource would be great. Currently we use the water levels and duplicate them to get higher volumes.
* Some sort of Upsert or Deletion usecases would be great. But probably not on the large datasets as for our wallet ^^
* Better Dashboards. The current one were thrown together quickly
* I would like to partitions the water_level measurements by day but run into apache/iceberg#5625. There might be ways around by using a dedicated Spark context for compaction but we can easily adopt the partitioning after the issue gets fixed. Sorting during rewrites did cost performance during compaction and did not provide real benefits for my intial measurements. Disabled for now.
* As always tracked my findings in https://github.com/stackabletech/stackablectl/issues/128

To get to the Spark UI `kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-warehouse-.*-driver') 4040`
@bors
Copy link

bors bot commented Nov 3, 2022

Pull request successfully merged into main.

Build succeeded:

@bors bors bot changed the title Add demo data-warehouse-iceberg-trino-spark [Merged by Bors] - Add demo data-warehouse-iceberg-trino-spark Nov 3, 2022
@bors bors bot closed this Nov 3, 2022
@bors bors bot deleted the demo-data-warehouse-iceberg-trino-spark branch November 3, 2022 14:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants