Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make segment-cache size configurable and use emptyDir for it #306

Closed
3 tasks
sbernauer opened this issue Sep 28, 2022 · 6 comments
Closed
3 tasks

Make segment-cache size configurable and use emptyDir for it #306

sbernauer opened this issue Sep 28, 2022 · 6 comments
Assignees
Labels
priority/medium release/2023-01 release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. size/M type/feature-new

Comments

@sbernauer
Copy link
Member

sbernauer commented Sep 28, 2022

Currently we the segment-cache location and size hardcoded to 300gb:

value: "[{\"path\":\"/stackable/var/druid/segment-cache\",\"maxSize\":\"300g\"}]"

Also /stackable/var/druid/segment-cache is not mounted but instead belongs to the container root directory.

We could either put the cache on a disk or in a ramdisk (by using Memory medium for emptydir).
My suggestion is putting the it on disk as this matches the Druid docs

Segments assigned to a Historical process are first stored on the local file system (in a disk cache) and then served by the Historical process

So we need a emptyDir without setting a explicit medium (using disk). We should also set the sizeLimit to the cache size.

  • segment-cache resides on emptyDir with correct sizeLimit
  • segment-cache size configurable.
  • segment-cache free percentage configurable. We default to 5% free percentage and set this as freeSpacePercent Druid attribute.
    CRD proposal
  historicals:
    roleGroups:
      default:
        replicas: 3
        config:
          resources:
            cpu:
              min: '200m'
              max: '4'
            memory:
              limit: '2Gi'
            storage:
              segmentCache: # Enum called e.g. "StorageVolumeConfig" (new)
                freePercentage: 5 # default: 5
                emptyDir: # struct EmptyDirConfig (new)
                  capacity: 10Gi
                  medium: "" # or "Memory"
                # OR
                pvc: # PvcConfig struct
                  capacity: 10Gi
                  storageClass: "ssd"

UPDATE: 04.11.12

Change of plan: since the operator framework doesn't support merging enum types currently, the solution above cannot be implemented. In agreement with others, a new temporary solution is proposed: an implementation with support for emptyDir storage will be made in this repository only. Later, when the framework is updated with the enum merging support, the complete solution from above will be implemented. This proposal is forward compatible with the one above from the user's perspective.

The manifest will look just like this (note the missing PVC configuration:

  historicals:
    roleGroups:
      default:
        replicas: 3
        config:
          resources:
            cpu:
              min: '200m'
              max: '4'
            memory:
              limit: '2Gi'
            storage:
              segmentCache:
                freePercentage: 5 # default: 5
                emptyDir:
                  capacity: 10Gi
                  medium: "" # or "Memory"

@lfrancke lfrancke changed the title Make segent-cache size configurable and use emptyDir for it Make segment-cache size configurable and use emptyDir for it Sep 30, 2022
@lfrancke
Copy link
Member

"on disk" -> is this an externally provided PV?

bors bot pushed a commit to stackabletech/stackablectl that referenced this issue Sep 30, 2022
## Description

Run with
`stackablectl --additional-demos-file demos/demos-v1.yaml --additional-stacks-file stacks/stacks-v1.yaml demo install nifi-kafka-druid-water-level-data`

Tested demo with 2.500.000.000 records


Hi all, here a short summary of the observations of the water-level demo:

NiFi uses content-repo pvc but keeps it at ~50% usage => Shoud be fine forever
Actions:
* Increase content-repo 5->10 gb, better safe than sorry. I was able to crash it by using large queues and stalling processors.

Kafka uses pvc (currently 15gb) => Should work fine for ~1 week
Actions:
* Look into retentions settings (low priority as it should work ~1 week) so that it works forever

Druid uses S3 for deep storage (S3 has 15gb). But currently it also cashes *everything* locally at the historical because we set `druid.segmentCache.locations=[{"path"\:"/stackable/var/druid/segment-cache","maxSize"\:"300g"}]` (hardcoded in https://github.com/stackabletech/druid-operator/blob/45525033f5f3f52e0997a9b4d79ebe9090e9e0a0/deploy/config-spec/properties.yaml#L725)
This does *not* really effect the demo, as 100.000.000 records (let's call it data of ~1 week) have ~400MB.
I think the main problem with the demo is that queries take > 5 minutes to complete and Superset shows timeouts.
The historical pod suspiciously uses exactly one core of cpu and the queries are really slow for a "big data" system IMHO.
This could be because either druid is only using a single core or because we dont set any resources (yet!) and the node does not have more cores available. Going to reasearch that.
Actions:
* Created stackabletech/druid-operator#306
* In the meantime configure overwrite in the demo `druid.segmentCache.locations=[{"path"\:"/stackable/var/druid/segment-cache","maxSize"\:"3g","freeSpacePercent":"5.0"}]`
* Research slow query performance
* Have a look at the queries the Superset Dashboard executes and optimize them
* Maybe we should bump the druid-operator versions in the demo (e.g. create release 22.09-druid which basically is 22.09 with a newer druid-op version). Therefore we get stable resources.
* Enable Druid auto compaction to reduce number of segments
@sbernauer
Copy link
Member Author

Nope, it's an emptyDir. Normally it's a spinning disk or ssd on the k8s node. A it's a cache there is no point saving it via a pvc

@soenkeliebau
Copy link
Member

soenkeliebau commented Oct 31, 2022

Integration test for this failed, so should be investigated some more.

https://ci.stackable.tech/job/druid-operator-it-custom/32/

@fhennig
Copy link
Contributor

fhennig commented Oct 31, 2022

Maybe run it on AWS EKS 1.22 (nightly runs on that) instead of IONOS 1.24

@fhennig fhennig removed their assignment Nov 2, 2022
@fhennig
Copy link
Contributor

fhennig commented Nov 2, 2022

I've unassigned myself, since this will go into a bigger phase of "In Progress" again

@razvan
Copy link
Member

razvan commented Nov 3, 2022

Blocked by: stackabletech/operator-rs#497

bors bot pushed a commit that referenced this issue Nov 4, 2022
Part of: #306 

This PR has been extracted from #320 which will be closed. The part that was left out is the actual configuration the of segment cache size. That will be implemented in a future PR and will require a new operator-rs release.

:green_circle: CI https://ci.stackable.tech/view/02%20Operator%20Tests%20(custom)/job/druid-operator-it-custom/34/


Co-authored-by: Sebastian Bernauer <[email protected]>
bors bot pushed a commit that referenced this issue Nov 14, 2022
# Description

This doesn't add or change any functionality.

Fixes #335 

Required for #306 

This is based on #333 and has to be merged after that.

:green_circle: CI: https://ci.stackable.tech/view/02%20Operator%20Tests%20(custom)/job/druid-operator-it-custom/39/

## Review Checklist

- [x] Code contains useful comments
- [x] CRD change approved (or not applicable)
- [x] (Integration-)Test cases added (or not applicable)
- [x] Documentation added (or not applicable)
- [x] Changelog updated (or not applicable)
- [x] Cargo.toml only contains references to git tags (not specific commits or branches)
- [x] Helm chart can be installed and deployed operator works (or not applicable)

Once the review is done, comment `bors r+` (or `bors merge`) to merge. [Further information](https://bors.tech/documentation/getting-started/#reviewing-pull-requests)
@adwk67 adwk67 self-assigned this Nov 16, 2022
@bors bors bot closed this as completed in 1978a8e Nov 16, 2022
@lfrancke lfrancke added the release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/medium release/2023-01 release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. size/M type/feature-new
Projects
None yet
6 participants